LNCS 2482 



■ Roger Dingledine 

PaulSyverson (Eds.) 



Privacy Enhancing 
Technologies 

Second International Workshop, PET 2002 
San Francisco, CA, USA, April 2002 
Revised Papers 



Springer 





Lecture Notes in Computer Science 2482 

Edited by G. Goes, J. Hartmanis, and J. van Leeuwen 




Springer 

Berlin 

Heidelberg 

New York 

Barcelona 

Hong Kong 

London 

Milan 

Paris 

Tokyo 




Roger Dingledine 
Paul Sy verson (Eds.) 



Privacy Enhancing 
Technologies 



Second International Workshop, PET 2002 
San Erancisco, CA, USA, April 14-15, 2002 
Revised Papers 




Springer 




Series Editors 



Gerhard Goos, Karlsruhe University, Germany 
Juris Hartmanis, Cornell University, NY, USA 
Jan van Leeuwen, Utrecht University, The Netherlands 

Volume Editors 

Roger Dingledine 

The Eree Haven Project 

144 Lake St., Arlington, MA 02474, USA 

E-mail: arma@freehaven.net 

Paul Syverson 

Naval Research Laboratory 

Washington DC 20375, USA 



Cataloging-in-Publication Data applied for 

A catalog record for this book is available from the Library of Congress. 

Bibliographic information published by Die Deutsche Bibliothek 

Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; 

detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>. 



CR Subject Classification (1998): E.3, C.2, D.4.6, K.6.5, K.4, H.3, H.4, 1.7 
ISSN 0302-9743 

ISBN 3-540-00565-X Springer- Verlag Berlin Heidelberg New York 



This work is subject to copyright. All rights are reserved, whether the whole or part of the material is 
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, 
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication 
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, 
in its current version, and permission for use must always be obtained from Springer- Verlag. Violations are 
liable for prosecution under the German Copyright Law. 

Springer- Verlag Berlin Heidelberg New York 
a member of BertelsmannSpringer Science-l-Business Media GmbH 

http://www.springer.de 

© Springer-Verlag Berlin Heidelberg 2003 
Printed in Germany 

Typesetting: Camera-ready by author, data conversion by Olgun Computergrafik 
Printed on acid-free paper SPIN: 10870570 06/3142 5 4 3 2 1 0 




Preface 



The second Privacy Enhancing Technologies workshop (PET 2002), held April 
14-15, 2002 in San Francisco, California, continued the enthusiasm and quality 
of the first workshop in July 2000 (then called “Designing Privacy Enhancing 
Technologies: Design Issues in Anonymity and Unobservability,” LNCS 2009). 

The workshop focused on the design and realization of anonymity and anti- 
censorship services for the Internet and other communication networks. For con- 
venience it was held at the Cathedral Hill Hotel just prior to the Twelfth Confer- 
ence on Computers, Freedom, and Privacy (CEP 2002), but it was not formally 
affiliated with that conference. 

There were 48 submissions of which we accepted 17. The program committee 
listed on the next page made the difficult decisions on which papers to accept, 
with additional reviewing help from Oliver Berthold, Sebastian ClauB, Lorrie 
Cranor, Stefan Kopsell, Heinrich Langos, Nick Mathewson, and Sandra Stein- 
brecher. Thanks to all who helped and apologies to anyone we have overlooked. 
Thanks also to everyone who contributed a submission. 

Besides the contributed papers we were honored to have an invited talk 
by someone who as much as anyone could be said to have started the entire 
technology field of anonymity and privacy: David Chaum. The talk covered his 
recently developed voting technology based on visual secret sharing, and as if 
that weren’t enough, also described new primitives for electronic cash. There was 
also a lively rump session, covering a variety of new and upcoming technologies. 

Adam Shostack served as general chair. The workshop ran remarkably 
smoothly - and in fact ran at all - thanks to Adam, who also personally took 
on the financial risk that we would break even. Thank you Adam. 
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Privacy-Enhancing Technologies for the Internet, II 

Five Years Later 



Ian Goldberg 

Zero-Knowledge Systems, Inc. 
ian@zeroknowled.ge . com 



Abstract. Five years ago, “Privacy-enhancing technologies for the Internet” [23] 
examined the state of the then newly emerging privacy-enhancing technologies. 
In this survey paper, we look back at the last five years to see what has changed, 
what has stagnated, what has succeeded, what has failed, and why. We also look 
at current trends with a view towards the future. 



1 Introduction 

In 1 997, the Internet was exploding. The number of people online was more than doubling 
every year, thanks to the popularity of email and the World Wide Web. But more and 
more of these people came to realize that anything they say or do online could potentially 
be logged, archived, and searched. It turns out this was not simply an idle fear; today, 
the WayBack Machine [30] offers archives of the World Wide Web back to 1996, and 
Google Groups [26] offers archives of Usenet newsgroups back to 1981 ! Even in 1996, 
the cartoon “Doctor Fun” recognized this problem enough to joke, “Suddenly, just as 
Paul was about to clinch the job interview, he received a visit from the Ghost of Usenet 
Postings Past.” [17]. 

So-called privacy-enhancing technologies were developed in order to provide some 
protection for these visitors to cyberspace. These technologies aimed to allow users to 
keep their identities hidden when sending email, posting to newsgroups, browsing the 
Web, or making payments online. 

The 1997 paper “Privacy-enhancing technologies for the Internet” [23] surveyed the 
landscape of past and then-current privacy-enhancing technologies, as well as discussing 
some promising future candidates. The need for privacy has not diminished since then; 
more people are still getting online and are being exposed to privacy risks. Identity theft 
[48] is becoming a bigger problem, and we are even seeing explicitly privacy-degrading 
technologies being deployed by companies like Predictive Networks and Microsoft, who 
aim to track consumers’ television- watching habits [13]. The need for privacy is still very 
real, and technology is our main tool to achieve it. 

In this paper, we take a second look around the privacy-enhancing technology land- 
scape. Although there are many such technologies, both in the offline and online worlds, 
we focus our attention on technologies aimed at protecting Internet users, and even then, 
we primarily discuss only technologies which have seen some amount of deployment. 
In our look around, we see some parts that appear just the same as five years ago, often 
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surprisingly so; we see some parts that are new, but not where we expected them. Finally, 
we take a stab at future directions, perhaps setting ourselves up for a sequel in another 
five years. 

2 What Was 

In this section, we recap the state of the world in 1997. For more detail, the interested 
reader is referred to the original paper [23]. 

2.1 What Was Well-Established 

In 1997, anonymous remailers for electronic mail were established technology. At first, 
there was the original “strip-headers-and-resend” style of remailer (also known as a “type 
0” remailer), the best-known example of which was anon.penet.h. The penet remailer 
also allowed for replies to anonymous posts: when you sent your first message through 
the system, you were assigned a fake address at the anon.penet.h domain as a pseudonym 
(or “nym”). The headers on your original email would then get re-written to appear to 
come from that nym. Replies to that nym would cause the remailer to look up your real 
email address in a table it kept, and the reply message would be forwarded back to you. 

Unfortunately, in 1996, legal pressure forced the operator, lohan Helsingius, to reveal 
user addresses stored in the table of nyms. In order to prevent further addresses from 
being forced to be divulged, Helsingius shut down the widely used remailer completely. 
[28] 

To combat the problems of a single operator being able to lift the veil of anonymity, 
either because he was a bad actor, or because he was able to be coerced, a new style 
of remailer was developed. These were called “type I” or “cypherpunk- style” remailers 
[11]. To use type I remailers, a user sends his message not via a single remailer, as 
with type 0, but rather, selects a chain of remailers, and arranges that his message be 
successively delivered to each remailer in the chain before hnally arriving at the mail’s 
inteded destination. 

These type I remailers also supported PGP encryption, so that each remailer in the 
chain could only see the address of the next remailer, and not the ones further down, or 
that of the hnal recipient (or even the body of the message). Only the last remailer in 
the chain (called the “exit node”) could see the address of the receipient, and the body 
of the message. 

2.2 What Was Current 

Type I remailers had always had some issues with security; for example, an attacker who 
could watch messages travel through the remailer network could easily trace messages 
from their destination back to their source if they were not encrypted, and even if they 
were, he could do the same simply by examining the sizes of the encrypted messages. 

In order to hx this and other security problems with the type I remailers, “type 
II”, or “Mixmaster” remailers were developed [45]. Mixmaster remailers always used 
chaining and encryption, and moreover, broke each message into a number of fixed-size 
packets, and transmitted each packet separately through the Mixmaster chain. The exit 
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node recombined the pieces, and sent the result to the intended recipient. This added to 
the security of the system, but at the cost of requiring special software to send email 
through the Mixmaster network (but no special software was required to receive email). 
In contrast, a type I message could be constructed “by hand” in a straightforward manner. 

Type II remailers provided an excellent mechanism for sending email without reveal- 
ing your identity. The most popular technique for securely arranging to receive email 
was the use of the “newnym” style nymserver [36]. This technology allowed you to pick 
a pseudonymous address at the nym.alias.net domain, and have that address associated 
to a “reply block”, which is a multiply-encrypted nested chain of addresses, much in 
the style of a type I remailer message. The remailer network, when presented a message 
and a reply block, would forward the message along each step in the chain, eventually 
causing it to reach the owner of the pseudonym. 

Another technique for receiving messages was the use of message pools. Simply 
arrange that the message be encrypted and posted to a widely distributed Usenet news- 
group such as alt.anonymous. messages. Since everyone gets a copy of every message, 
it’s not easy to tell who’s reading what. 

Using a combination of the above techniques, anonymous and pseudonymous email 
delivery was basically a solved problem, at least from a technical point of view. 

Attention turned to other directions; email is not the only interesting technology 
on the Internet. The next obvious choice was the World Wide Web. In 1997, the state 
of the art was roughly equivalent to the technology of type 0 remailers for email. The 
Anonymizer [2] was (and still is) a web proxy you can use to hide your IP address and 
some other personal information from the web sites you visit. Your web requests go from 
your machine, to the Anonymizer, to the web server you’re interested in. Similarly, the 
web pages come back to you via the Anonymizer. 

Finally, technology was being rolled out in 1997 for the use of anonymous digital 
cash. The promise of being able to pay for things online in a private fashion was enticing, 
especially at a time when consumers were still being frightened away from using their 
credit cards over the Internet. There were a number of companies rolling out online 
payment technologies; the most privacy-friendly technology involved was invented by 
David Chaum and was being commercialized by a company called DigiCash [9]. 

2.3 What Was Coming 

The short-term horizon in 1997 had a number of special-purpose projects being studied. 
Ross Anderson’s “Eternity Service” [1] (later implemented in a simpler form by Adam 
Back as “Usenet Eternity” [4]) promised the ability to publish documents that were 
uncensorable. In Back’s implementation, the distributed nature of Usenet was leveraged 
to provide the redundancy and resiliancy required to ward off attempts to “unpublish” 
information. 

Perhaps the most promising upcoming technology in 1997 was Wei Dai’s proposal 
for “PipeNet”: a service analogous to the remailer network, but designed to provide 
anonymity protection for real-time communication, such as web traffic, interactive chats, 
and remote login sessions [12], The additional difficulties imposed by the real-time 
requirement required significant additions in complexity over the remailer network. 
However, the range of new functionality potentially available from such a system would 
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be considerable. For example, without a PipeNet-like system, anonymous digital cash 
isn’t very useful: it would be like sending an envelope with cash through the mail, but 
putting your (real) return address on the envelope. 

Unfortunately, PipeNet was never developed past the initial design stages. Onion 
Routing [24] was another project that was starting to get deployed in 1997, and which 
was attempting to accomplish similar goals as PipeNet. Onion Routing, however, elected 
to trade off more towards performance and robustness; in contrast, PipeNet chose security 
and privacy above all else, to the extent that It preferred to shut down the entire network 
if the alternative was to leak a bit of private information. 



3 What Happened Since 

So that was 1997. It’s now 2002. What changes have we seen in the privacy-enhancing 
technology landscape in the last five years? 

The widespread consumer acceptance of the World Wide Web has led to further 
research into privacy protection in that space. An example is Crowds [43], an AT&T 
project which aims to apply the principles of type I remailers to the World Wide Web. 
The tag line for the project is “Anonymity Loves Company”. The principle is that the 
set of people utilizing this system forms a crowd. A web request made by any member 
of the crowd is either submitted to the web server in question (as would be the usual 
case for web surfing without the Crowds system), or else sent to another member of the 
crowd. (The choice is made randomly.) If it is sent to another member, that member again 
randomly decides whether to submit the request or to pass it off to another member, and 
so on. Eventually the request makes it to the web server, and the response is handed off 
down the chain of requesting members until it reaches the member who originated the 
request. 

The idea is similar to that of chaining used in type 1 remailers, but with a couple of 
notable differences. First, unlike in the remailer case, the chain used is not selected by 
the user, but is instead randomly generated at a hop-by-hop level. Also, cryptography is 
not used to protect the inter-member communications. This reflects the different threat 
model used by Crowds: it is only trying to provide plausible deniability against the 
web server logs compiled by the site operator; for example, if it is observed that your 
machine made a web request for information about AIDS drugs, it is known only that 
some member of the worldwide crowd requested that information, not that it was you. 
Crowds makes no attempt to thwart attackers able to sniff packets across the Internet. 

A recent German project, JAP (Java Anonymous Proxy), aims to protect against a 
larger class of threats, though still only with respect to protecting the privacy of people 
browsing the Web [21]. JAP applies the ideas of type II remailers to web surfing; requests 
and responses are broken up into constant-size packets, encrypted, and routed through 
multiple intermediate nodes, called mixes. Each mix waits for a number of packets 
to arrive, then decrypts one layer from each, and sends them on their way in a single 
randomly-ordered batch. 

Privacy when browsing content on the Web is not the only important consideration; 
some information is important to distribute, yet may get the distributors in trouble with 
local authorities or censors. That some information is deemed by a local government 
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somewhere in the world as unsuitable should not mean the provider should be forced 
to remove it entirely. Server-protecting privacy systems allow for the publishing of 
information online without being forced to reveal the provider’s identity or even IP 
address. 

It should be noted that it is usually insufficient to simply put the information on a 
public web hosting service, because the provider of that service will simply remove the 
offending material upon request; it has very little incentive not to comply. 

The aforementioned Eternity service was a first attempt to solve this problem. More 
recently, projects such as Free Haven [14], FreeNet [10], and Publius [49] aimed for 
similar goals. With Publius, documents are encrypted and replicated across many servers. 
The decryption keys are split using a secret-sharing scheme [46] and distributed to the 
servers. A special URF is constructed that contains enough information to retrieve the 
encrypted document, find the shares of the key, reconstruct the decryption key, and 
decrypt the document. 

Publius cryptographically protects documents from modification, and the distributed 
nature attempts to ensure long-term availability. In addition, the encrypted nature of the 
documents provides for deniability, making it less likely that the operators of the Publius 
servers would be held responsible for providing information they have no way to read. 

With Free Haven, the aim is also to provide content in such a manner that adversaries 
would find it difficult to remove. Free Haven also provided better anonymity features to 
publishers than did Publius. 

More ambitiously, a couple of projects aimed to implement systems somewhat along 
the lines of PipeNet. As mentioned above, the Naval Research Fab’s Onion Routing 
[24, 25] provided (for a while) more general anonymity and pseudonymity services, 
for applications other than simply web browsing; services such as remote logins and 
interactive chat were also supported. IP packets were forwarded between nodes situated 
around the Internet. 

A little later, Zero-Knowledge System’s Freedom Network [6] rolled out another 
PipeNet-inspired project, this one as a commercial venture. Eventually, it was the large 
infrastructure requirement that was to be the Freedom Network’s downfall. Whereas the 
nodes in the remailer network are all run by volunteers, the Freedom Network was a 
commercial venture, and there were non-trivial costs associated with operating (or paying 
people or organizations to operate) the nodes. There were costs associated with network 
management and nym management. In addition, some defenses against traffic analysis 
(such as link padding) use an exorbitant amount of bandwidth, which is particularly 
expensive in some parts of the world. Finally, if users are paying for a service, they 
expect high-quality performance and availability, which are expensive to provide, and 
were not required in the free, volunteer remailer network. 

Zero-Knowledge Systems simply could not attract enough of a paying customer base 
to support the overhead costs of running a high-quality network. And they were not the 
only ones; other commercial ventures which operated large-scale infrastructure, such as 
Safe Web [44], suffered the same fate. 

The lackluster acceptance of electronic cash could be attributed to similar causes. In 
the last five years, we have seen many protocols for online and offline electronic payment 
systems, with varying privacy properties (for example, [9], [7], [40]). The protocols are 
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there, and have been around for some time. But no company has successfully deployed it 
to date. Why is that? For one thing, electronic cash is really only useful when it is widely 
accepted. Furthermore, in order for it to interoperate with the “real” money system, 
financial institutions need to be involved. This can have enormous infrastructure costs, 
which will be challenging to recoup. 

Note that one could construct a “closed” ecash-like system, where there is no ex- 
change of value between the ecash system and the rest of the world, which does not 
have this problem. A slightly more general technology is called “private credentials” 
[7], in which the holder of a credential can prove quite complicated assertions about his 
credential without revealing extra information. For example, you could prove that you 
were either over 65 or disabled (and thus entitled to some benefit), without even reveal- 
ing which of the two was the case, and certainly without revealing other indentifying 
information such as your name. Electronic cash can be seen to simply be a special case 
of this technology, wherein the credential says “this credential is worth $1 and may only 
be used once”. 

Private credentials are highly applicable to the realm of authorization, which is im- 
portant to distinguish from authentication. With an authentication system, you prove 
your identity to some entity, which then looks you up in some table (for example, an 
access control list), and decides whether you’re allowed to access whatever service. On 
the other hand, with an authorization scheme, you simply directly prove that you are 
authorized to access the service, and never reveal your identity. Authorization schemes 
allow for much more privacy-friendly mechanisms for solving a variety of problems. 
Juggling many such authorizations, however, can lead to a non-trivial trust manage- 
ment problem. Systems such as KeyNote [5] allowed one to make decisions based on 
authorizations for keys, as opposed to authentication of people. 

So far, almost every commercial privacy technology venture has failed, with Anony- 
mizer.com [2] being a notable exception. Originally hosting the Anonymizer (see above), 
Anonymizer.com also offers services including email and newsgroup access, as well 
as dial-up Internet access. Compared to other infrastructure-heavy attempts, Anony- 
mizer.com has a relatively simple architecture, at the expense of protecting against a 
weaker threat model. But it seems that that weaker threat model is sufficient for most 
consumers, and we are starting to see other companies similarly relaxing their threat 
models [51]. 

Why is deploying privacy-enhancing technologies so difficult? One large problem 
is that, generally, these technologies are not simply software products that an end user 
can download and run, and in so doing, gain some immediate privacy benefit. Rather, 
there is often some infrastructure needed to support aggregation of users into anonymity 
groups; not only does this add to the cost of deployment, but users in this case only really 
accrue privacy benefits once a large number of them have bought into the system. 

We can divide privacy-enhancing technologies into four broad categories, roughly 
in increasing order of difficulty of deployment: 

Single party: These are products, such as spam and ad blockers, and enterprise privacy 
management systems, that can in fact be installed and run by a single party, and 
do not rely on some external service, or other users of the system, in order to be 
effective. 
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Centralized intermediary: These technologies are run as intermediary services. An 
intermediary maintains a server (usually a proxy of some sort) that, for example, 
aggregates client requests. Deploying and maintaining such a server is relatively 
easy, hut if it goes away, the customers lose their privacy advantage. The Anonymizer 
and anon.penet.fi are examples of technologies in this category. 

Distributed intermediary: The technologies in this category, such as the remailer net- 
work, Crowds, and the Freedom Network, rely on the cooperation of many distinct 
intermediaries. They can be made more robust in the face of the failure of any one in- 
termediary, but the cost involved to coordinate and/or incentivize the intermediaries 
to cooperate may be quite large. 

Server support required: This last category contains technologies that require the co- 
operation of not just a single or a handful of intermediaries, but rather that of every 
server with which the user wishes to perform a private transaction. An example of a 
technology In this class is private electronic cash, where every shop at which a user 
hopes to spend his ecash needs to be set up in advance with the ability to accept it. 

In general, technologies whose usefulness relies on the involvement of greater num- 
bers of entities, especially when non-trivial infrastructure costs are involved, will be 
more difficult to deploy. 



4 What May Be Coming 

4.1 Peer-to-Peer Networks and Reputation 

How do we address this problem of deploying expensive infrastructure? The remailer 
network does it with volunteers; can we expand on that idea? Perhaps we can take a 
page from the peer-to-peer (p2p) playbook. If a good amount of the infrastructure in the 
system can be provided by the users of the system themselves (as in the Crowds project, 
for example), we reduce not only the cost to the organization providing the service, but, 
in the extreme case, the entire reliance on the existence of the organization itself, making 
the end users supply the pieces of the infrastucture. A p2p technology builds right in 
the idea of distributing trust instead of centralizing it. By removing any central target, it 
provides more resistance against censorship or “unpublishing” attacks. 

Peer-to-peer systems are a natural place to put privacy-enhancing technologies for 
another reason, as well: the most common use of p2p networks today is for hie sharing. 
As was seen in the case of Napster [3], although users really enjoy sharing music and 
other hies over the Internet, most p2p protocols do not have any sort of privacy built into 
them. Users sharing particular hies can, and have been, tracked or identihed. Adding 
privacy technology to a p2p network provides obvious advantage to the user, as well as 
providing a useful service. 

Another problem with today’s p2p networks is that anyone can respond to a request 
incorrectly. There exist programs for the Gnutella network [22], for example, that will 
respond to any request with a hie of your choice (probably advertising). As p2p networks 
grow, combatting this problem will become Important. One solution interacts well with 
privacy-enhancing technologies; that is the use of reputation. A collaborative reputation 
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calculation can suggest the trustworthiness of a user, whether that user is completely 
identihed or pseudonymous. 

We are starting to see reputation systems deployed today, in such communities as 
Ebay [15], Slashdot [39], and Advogato [35]. As more research is done in this area, 
combining this work with privacy-enhanced peer-to-peer networks, in a manner such as 
begun by Free Haven [14], is a natural step. 

4.2 Privacy of Identity vs. Privacy of PH 

Most privacy-enhancing technologies to date have been concerned with privacy of iden- 
tity; that is, the controlling of the distribution of information about who you are. But 
there are many other kinds of information about yourself that you might want to control. 
Personally identifiable information, or PII, is any information that could be used to give 
a hint about your identity, from your credit card number, to your ZIP code, to your 
favourite brand of turkey sausage. 

Consumers are starting to get concerned about the amount of PII that is collected 
about them, and are looking for ways to maintain some control over it. A number of 
technologies allow the management of web-based advertisements or HTTP cookies 
[33], for example. Technologies such as Junkbuster [32] and P3P [42] allow the user to 
control what ads they see and what cookies they store. P3P even allows the choice to be 
made based on the website’s stated privacy practices, such as whether the user is able to 
opt out of the PII collection. Private credential technologies such as Brands’ [7] allow 
the user to prove things about himself without revealing extra personal information. 

Sometimes, however, it is not an option to prevent the collection of the information; 
some kinds of PII are required in order to deliver the service you want. For example, on- 
line retailers need your delivery address and payment information; health care providers 
need your medical history; people paying you money need your SSN'. In response, a 
number of industry players, for example [29, 37, 47, 50], are rolling out products that: 

- help consumers manage to whom they give access to their personal information, and 

- help organizations that collect said information keep control over it and manage it 

according to their stated privacy policies. 

This “enterprise-based privacy’’ aims to provide technology for protecting data that 
has already been collected, as opposed to preventing the collection in the first place. 

However, whereas the consumer obviously has an interest in keeping his personal 
information private, what incentive does an organization have to do the same? In addition 
to better customer relationships, organizations which collect personal data today often 
have to comply with various sorts of privacy legislation, which we will discuss next. 

4.3 Technology vs. Legislation 

In recent years, we have seen an escalating trend in various jurisdictions to codify privacy 
rules into local law. Laws such as PIPEDA [41] in Canada, COPPA, HIPAA, and the 

* Sometimes, “need” is a strong word. Although, for example, there are ways to make payments 
online and arrange deliveries without using your credit card number or physical address, it’s 
unlikely the company you’re dealing with will go through the trouble of setting up support for 
such a thing. 
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GLB Act [19, 27, 20] in the US, and the Data Protection Directive [16] in the EU aim to 
allow organizations misusing personal data to be penalized. The German Teleservices 
Data Protection Act [8] even requires providers to offer anonymous and pseudonymous 
use and payment services, and prohibits user profiles unless they are pseudonymous. 

This is an interesting development, since many in the technology community have 
long said that the security of one’s transactions should be protected by technology, and not 
by legislation. For example, technologists have often criticized the cellphone industry for 
spending money lobbying the government to make scanning cellphone frequencies and 
cloning phones illegal rather than implementing encryption that would render it difficult, 
if not impossible. While from a financial point of view, the cellphone companies clearly 
made the correct decision, the result is that criminals who don’t care that they’re breaking 
an additional law still listen in on phone calls, and sell cloned cellphones, and people’s 
conversations and phone bills are not in any way more secure. 

What has changed? Why are we now embracing legislation, sometimes without 
technology to back it up at all? 

The reason lies in the differing natures of security and privacy. In a privacy-related 
situation, you generally have a pre-established business relationship with some organi- 
zation with whom you share your personal data. An organization wishing to misuse that 
data is discouraged by the stick of Law. 

On the other hand, in a security-related situation, some random eavedropper is pluck- 
ing your personal information off of the airwaves or the Internet. You usually don’t have 
someone to sue or to charge. You really need to prevent them from getting the data in the 
first place, likely through technological means. In the privacy case, you don’t want to 
prevent your health care provider from getting access to your medical history; you just 
don’t want them to share that information with others, and as we know from the world 
of online filesharing [38, 22, 18], using technology to prevent people from sharing data 
they have access to is a non-trivial problem^. 

With traditional privacy-enhancing technologies, the onus was entirely on the user 
to use whatever technology was available in order to protect himself. Today, there are 
other parties which need to be involved in this protection, since they store some of your 
sensitive information. Legislation, as well as other social constructs, such as contracts, 
help ensure that these other parties live up to their roles. 

So with or without technology to back it up, lesigslation really is more useful in the 
privacy arena than in the security field. Of course, it never hurts to have both; for example, 
the enterprise-based technologies mentioned above can be of great assistance in ensuring 
compliance with, and enforcement of, relevant legislation. In particular, now more than 
ever, technologists need to remain aware of the interplay between their technology and 
the changing legislative environment. [34] 



^ Some people have (sometimes half-jokingly) suggested that Digital Rights Management [31] 
techniques from the online music arena could be flipped on their heads to help us out here; a 
consumer would protect his personal data using a DRM technique, so that it could be used only 
in the ways he permits, and could not be passed from his health care provider to his health food 
salesman. 
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5 Conclusion 

The last five years have been hard for privacy-enhancing technologies. We have seen 
several technologies come and go, and have witnessed the difficulty of deploying systems 
that rely on widespread infrastructure. The technical tools that remain at our disposal 
are somewhat weak, and we are unable to achieve bulletproof technological protection. 

Luckily, many applications do not require such strength from technology; many ap- 
plications of privacy are social problems, and not technical ones, and can be addressed by 
social means. When you need to share your health care information with your insurance 
provider, you cannot use technology to prevent it from distributing that information; 
social constructs such as contracts and legislation really can help out in situations like 
those. 

In closing, what strikes us most about the changes in privacy-enhancing technologies 
over the last five years, is that very little technological change has occurred at all, espe- 
cially in the ways we expected. Instead, what we see is an increased use of combinations 
of social and technological constructs. These combinations recognize the fact that the 
desired end result is not in fact the technological issue of keeping information hidden, 
but rather the social goal of improving our lives. 
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Abstract. This article is a case study in the design, implementation, 
and deployment of the Bugnosis privacy enhancing tool. Downloaded and 
installed by over 100,000 users to date, Bugnosis contributes to network 
privacy indirectly — without any technical protection measures such as 
filtering or anonymization — by raising awareness about Web bugs and 
arming users with specific information about current Web site practices. 



1 Introduction 

Bugnosis is an add-on for Internet Explorer that detects Web bugs during Web 
surfing and alerts the browser operator to their presence. In this article we 
highlight the most interesting questions and challenges we faced as Bugnosis 
grew from a summer afternoon exercise into an independent study project and 
then a full-blown distribution. 

We define the Bugnosis notion of Web bugs in section 3.1. Briefly, Web bugs 
are invisible third party images added to a Web page so that the third party 
receives notice of the page viewing event. (The name “bug” comes from the term 
“electronic bug,” which is slang for a tiny, hidden microphone. It has nothing to 
do with a “bug in a program.” Web bugs are also variously called Web beacons, 
pixel tags, and clear GIFs.) A third-party cookie is often associated with the 
Web bug transaction, which may allow the third party to recognize the user’s 
Web browser uniquely. Web bugs are controversial because not only are they 
often used to gather information about user behavior without user consent, but 
they also trick Web browsers into assisting with this surveillance by claiming 
that an image is required as part of a Web page when in fact the image has no 
content benefltting the user. 

Bugnosis’ operation is straightforward. During installation, it attaches itself 
to Internet Explorer so that it is automatically invoked whenever IE runs. Bug- 
nosis keeps a low profile while the user browses the Web, but it silently examines 

* This work was supported by the Privacy Foundation, the University of Denver, and 
Boston University. Most of this work was done at the University of Denver Computer 
Science Department. 
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Fig. 1. Screen shot of Bugnosis announcing its discovery of a Web bug. The areas 
labeled #1, #2, and ^3 identify information provided by Bugnosis. 



each viewed Web page for the presence of Web bugs. If it finds a Web bug, it 
alerts the user with a cute sound (“uh-oh!”) and displays information about the 
Web bug in a separate pane. 

Figure 1 shows an Internet Explorer window after Bugnosis has identified 
a Web bug. In addition to the usual IE paraphernalia, three new areas on the 
screen are visible and indicated by number. Area #1 is the toolbar Bugnosis 
control, consisting of a drop-down control menu, a bug-shaped pane visibility 
toggle switch, a severity meter, and text information about the page. The severity 
meter progresses from yellow to red. In the original color image, the first three 
severity bars are lit up, making this a “medium severity” Web bug. The text 
shows that Bugnosis has analyzed 17 images on the page, found one Web bug, 
and no merely “suspicious” images. Until a Web bug has been discovered, this 
toolbar line is the only indication that Bugnosis is present. (Bugnosis 1.1, the 
currently available version, displays only the visibility toggle switch.) 

When Bugnosis discovered the Web bug on the page, it replaced the normally 
invisible pixels of the bug image with the 18x18 pixel cartoonish image of a bug 
labeled #2 in the middle area of the screen. On-screen, the bug hashes mildly 
and gallops along so that the user can spot it easily. 

The bottom pane on the screen, labeled #3, shows details about the Web bug 
itself, such as the URL of the image, the technical properties that lead Bugnosis 
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to consider it a Web bug, and clickable icons. The information here is arranged 
in a table, but since only one Web bug was discovered, only one row is visible. 

2 Philosophy and Strategy 

Bugnosis is designed to recognize invisible images served from 3rd-party servers 
as Web bugs, while excluding those images that seem to be used only for align- 
ment purposes. See sections 3.1 and 3.2 for detailed definitions of Web bugs. 

The fascinating thing about Web bugs is that they allow coherent discussion 
of the well known third party tracking threat: namely, the observation that a 
cookie-laden third party image provider with wide Internet coverage is capable 
of gathering very detailed clickstream records about a user population. In the 
case of Internet advertising firms, this discussion about user tracking is clouded 
by the fact that the firms’ objectives may be totally fulfilled by displaying an 
advertising image in a Web browser. Although the firms may be interested in 
tracking their user population, the delivery of a third party image is not enough 
to confirm or deny this suspicion. But Web bugs, by definition, are used only to 
gather information about users. The question is no longer “are they gathering 
information?” but, by narrowing our interest to Web bug providers, “why are 
they gathering information?” and “what will they do with it?” 

2.1 Audience 

Our intended audience for Bugnosis is neither whistle-blowers, nor the politically 
oppressed, nor citizens of states with challenging privacy histories, nor overly 
scrutinized corporate employees — none of the traditional audience for privacy 
enhancing tools. Our true audience is journalists and policy makers. This is 
why: the cookie-based third party tracking threat is just too hard to explain to 
people. To get from the concept of “a tiny file we store on your computer” all the 
way to “the ability to track you between Web sites” requires a lot of otherwise 
irrelevant discussion about Internet topology, HTTP, browser settings, database 
index keys, and persistence; by then, most people have lost interest in the threat. 
If journalists and policy makers do not understand the threat, then not much 
hope remains for the rest of the users. 

So our goal was to construct software that educates journalists and policy 
makers about user tracking. Proceeding under the assumption that most of them 
are comfortable with a Web browser, and that most probably use Microsoft In- 
ternet Explorer for Web browsing and Microsoft Outlook or Outlook Express for 
e-mail, we could not assume much about their technical sophistication. There- 
fore, ease of installation and ease of use were absolutely critical to a successful 
deployment. 

2.2 No Web Bug Blocking 

An important policy decision we faced was whether Bugnosis should “protect” 
users by blocking those Web bugs that it identified. Ultimately we decided 




16 



Adil Alsaid and David Martin 



against it. Although implementation considerations also play into this, we even- 
tually realized that blocking Web bugs had absolutely nothing to do with our 
education goal. In fact, it would have distracted from our goal. If we had added 
a blocking function, then Bugnosis would have been considered yet another tool 
for those relatively few people who are so concerned about user tracking that 
they are willing to invest the time to download, install, and maintain yet another 
little piece of optional software. But helping a few users block their Web bugs 
has no impact on the overall tracking regimen, and it only does a little for those 
users anyway. Basically, adding a blocking feature to Bugnosis would support 
the opt-out paradigm favored by those behind the tracking systems, precisely by 
being an opt-out mechanism. Without the blocking feature, Bugnosis remains 
agnostic on this point: it merely informs. 

2.3 Low Resistance to Countermeasures 

Web bugs are essentially covert channels. Therefore, it’s pointless to try too 
hard to spot them; an adversary that wants to evade Bugnosis can do so easily. 
Bugnosis simply provides a useful snapshot of current Web site practices. 

2.4 Expert Knowledge Database 

Although we believe Bugnosis does a good job at identifying Web bugs by 
their form alone, we recognized early on that an expert knowledge component 
would provide an important safety valve. Bugnosis therefore incorporates a small 
database of regular expressions and corresponding dispositions either forcing or 
preventing matching URLs from being identified as Web bugs. For example, one 
rule in the database forces the URL http://tnps.trivida.com/tnps/track.fgi to 
be identified as a Web bug independently of other automated considerations^. 
The database currently contains 23 such “positive” rules, all of which are derived 
from patterns described by Richard M. Smith [2] in his early investigations of 
Web bugs. In practice, these rules are seldom decisive; we found that 96% of 
the Web bugs that Bugnosis identified using these rules in a large sample would 
have been identified as Web bugs even if the rules had been missing [3]. 

The database currently contains only 6 “negative” rules preventing the iden- 
tification of Web bugs. Experience showed us that some certain image forms 
served by Yahoo and Akamai that had the formal characteristics of Web bugs 
were so common that Bugnosis alerted us constantly, even though our inspection 
of the relevant privacy policies suggested the risk was small. So these URLs are 
on our negative list. We have to cringe a bit when we manipulate the list, but 
we also believe that it is important for Bugnosis not to be too alarmist. We 
have always been more concerned about the possibility of false positives than in 
overlooking some of the Web bugs. 

We included a mechanism to change the database over time. Bugnosis at- 
tempts to acquire updated versions of the database every couple of weeks (with 

^ TriVida Corp. was acquired by BeFree Inc. in March 2000; this URL appears to be 
no longer in use, and so the rule is actually obsolete. 
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the user’s permission). This became useful a few weeks after deployment when 
we discovered an unfortunate interaction between Bugnosis and the Web Washer 
[4] and Adextinguisher [5] privacy-enhancing tools. These tools block third party 
images by rewriting their URLs in a way that makes them look like Web bugs 
to Bugnosis. In the case of WebWasher, the rewritten URL actually appeared 
to some users to be transmitting information to the WebWasher company, even 
though a closer look made it clear that this was not true. In early December 
2001, we also worked with Google personnel to exclude a set of URLs containing 
raw IP addresses. Since these URLs were being used only to measure network 
delay, Google could not change them into a more Bugnosis- friendly form without 
degrading their measurements. We were able to handle all three of these cases 
simply by adding negative rules in the database; no other software update was 
necessary. 

The database also contains “neutral” rules that simply associate URL pat- 
terns with Web site privacy policies and e-mail contact addresses. This is ex- 
plained further in section 4.2. Altogether, the Bugnosis database currently con- 
tains 45 entries. 

3 Web Bugs Defined 

3.1 Current Definitions 

Definition 1 (vague). A Weh bug is any HTML element that is (1) present 
at least partially for surveillance purposes, and (2) is intended to go unnoticed 
by users. 

In order to automatically identify Web bugs, we had to refine this vague 
definition into something implementable. To get to the first specific Web bug 
definition, we define some properties that an HTML element may have. 

Property 1 (hostdiff). An element has this property if the host named in its URL 
is not exactly the same as the host named in the URL of the page containing 
the element. 

Property 2 (domaindiff). An element has this property if the host named in its 
URL is third party with respect to the URL of the page containing the element. 
Specifically, this means that their two highest DNS levels (i.e., rightmost two 
dot-separated components) differ. For example, www.bu.edu is third party to 
www.du.edu, because bu.eduyfdu.edu. (Although this definition is useful in the 
generic domains, it does not properly capture the notion of third party in some 
of the country code domains. We return to this issue in section 3.2.) 

We are now able to present the two definitions used by other groups of 
researchers. 

Definition 2 (Cy-ID). A Cy-ID Web bug is an HTML element such that (1) 
the element is an image, (2) the image is 1x1 pixel large, and (3) the image has 
the domaindiff property. 
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Definition 3 (SS). An SS Web bug is an HTML element such that (1) the 
element is automatically loaded, and (2) the element has the hostdiff property. 

The Cy-ID definition describes the Web bugs identified in the Cyveillance 
study by Murray and Cowart [6] and those elements identified as “invisible ob- 
jects” in IDcide’s Site Analyzer product [7]. This definition captures the notion 
of “surreptitious” as meaning that the image contains only one pixel. The SS 
definition, used in the SecuritySpace report by Reinke [8], essentially tags ev- 
ery transaction involving a third party as a surveillance transaction. Since it 
uses hostdiff rather than domaindiff, it has a very suspicious notion of what a 
third party is: for instance, a page loaded from www.example.com that refer- 
ences an image from imgs.example.com would be designated a Web bug under 
this definition. 

Stating the Bugnosis definition of Web bugs requires several more properties. 

Property 3 (tiny). An image is tiny if it is 7 square pixels or less. 

Property ) (lengthy). Let imageURLs denote the list of all image URLs on 
the page containing the image in question. An image is lengthy if either (1) 
imageURLs contains one element and this image’s URL contains more than 100 
characters, or (2) imageURLs contains more than one element and this image’s 
URL length exceeds p. + 0.75ct characters, where p and cr are the measured 
mean and standard deviation of the imageURLs string sizes. An image with this 
property appears to be communicating something unusual in its URL. 

Property 5 (once). An image has this property if its URL appears only once in 
the list of all image URLs on the page containing the image in question. An 
image that appears only once on a page is more likely to be a tracking device 
than an image that appears multiple times. 

Property 6 (protocols). An image has the protocols property if its URL contains 
more than one substring from the set ‘http:’, ‘https:’, ‘ftp:’, ‘file:’. For example, 
an image with the URL http://track.example.eom/log/ftp://www.source.com 
would have this property. This property indicates that its URL may contain 
some tracking information. 

Property 1 (tpcookie). An image has the tpcookie (“third party cookie”) prop- 
erty if it has the domaindiff property and the Web browser has a cookie stored 
for the image’s domain. 

Property 8 (positive). An image has this property if its URL matches an entry 
in the “positive” database (see section 2.4). 

Property 9 (negative) . An image has this property if its URL matches an entry 
in the “negative” database (see section 2.4). 

Given these extra properties, we can now define the Bugnosis notion of a 
Web bug. 
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Definition 4 (Bugnosis). An HTML element is a Bugnosis Web bug if (1) 
the element is an image, and either (2) it has the positive property, or (3) does 
not have the negative property but does have the tiny, domaindiff, and at least 
one additional property other than hostdijf. 

Basically, in order to consider an image a Web bug, Bugnosis requires it to 
be surreptitious, third-party, and to carry some additional evidence that it is 
unusual. In practice this means that Bugnosis does not consider tiny images 
used for spacing purposes as Web bugs, nor does it automatically consider all 
third party content to be Web bugs. 

3.2 Discussion 

Although it is certainly possible to obtain perfectly satisfactory surveillance 
capabilities by using HTML elements other than images, there is no way to 
distinguish between, say, a JavaScript program that is fetched only in order to 
trigger a log entry, and one that actually has some “meaningful” content. So 
Bugnosis only examines images, where it has a hope of telling the difference. 

There are only two ways for an image to be considered a Bugnosis Web bug 
but not a Cy-ID or SS Web bug: (1) if it matches our “positive” database, or 

(2) if it is just a few pixels too big to be a Cy-ID Web bug. The former case 
doesn’t any introduce error, because our database is built to be consistent with 
our primary definition 1. The latter case is a possible source of error: we chose 
to identify images 7 square pixels or smaller to be tiny by just gazing at the 
screen and determining that images that size were not terribly useful for graphic 
content, even though they certainly can be visible. Although we haven’t observed 
any problems with our definition — we have indeed seen Bugnosis accurately 
identify some 1x2 and 2x2 images as Web bugs — 7 square pixels may have been 
overkill. 

We believe the Bugnosis definition of Web bugs is as strong as any other 
Web bug definition in use. Even so, subsequent Bugnosis versions will make it 
tighter: 

Definition 5 (NG). An HTML element is a (next generation) Bugnosis Web 
bug if (1) the element is an image, and either (2) it has the positive property, or 

(3) does not have the negative property but does have the tiny, strong- domaindiff, 
tpcookie, and once properties. 

Note that this definition says nothing about the “lengthy” and “protocols” 
properties. These properties are no longer central to the Web bug identification 
process; however, once a Web bug has been found, they are used to sort the Web 
bugs by severity. 

We used this tightened definition in our survey of Web bug use [3] . It includes 
three improvements over the definition used in the currently available Bugnosis 
release. First, we decided that third party sites that did not use cookies were 
unlikely to be tracking users individually, since they would have no practical 
way to distinguish users. (We are not aware of any non-cookie identification 
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techniques in common practice.) On the other hand, although the presence of a 
third party cookie may indicate individual tracking, it may also be benign, or 
even privacy enhancing: a cookie that says “TextOnly=true” is a handy way to 
record a user preference without requiring a full blown registration procedure. 
Still, since definition 4 also checked for third party cookies, this addition only 
causes us to identify fewer Web bugs than before. 

The second improvement is the requirement of the “once” property, rather 
than simply taking “once” as some positive evidence of a Web bug. An image 
whose URL occurs multiple times on the page (and therefore not “once”) is 
clearly not a Web bug: the second URL request would probably not even make 
it out of the browser’s cache, and even if it did, the origin server would end 
up with two practically identical log entries. No Web designer would naturally 
repeat a URL as part of a tracking scheme. 

Finally, the third change improves our ability to see Web bugs in a wider part 
of the DNS space. As noted previously, the “domaindiff” notion of “third party” 
only examined the top two layers of the hosts’ DNS names. By this definition, the 
University of Cambridge Web site www.cam.ac.uk and the University of Oxford 
site www.ox.ac.uk seem to be in the same domain, which is clearly absurd. We 
address this with the “strong-domaindiff” property and an auxiliary function 
called reach)] that maps domain name suffixes (com, net, etc.) to domain name 
levels: 

Property 10 (strong-domaindiff). Suppose an element E from the host named 
H is embedded in a page served from an origin server named O. Let O’ be the 
longest suffix of O such that reach[0’] is defined, and let n=reach[0’j, or n = 3 if 
no such O’ exists. Then E has the strong-domaindiff property if the top n DNS 
levels of H and O coincide. 

The reach)] function is bundled with our expert knowledge database (section 
2.4). For example, it specifies reach)uk]=3, reach)com]=2, reach)net]=2, and so 
on. Thus we are now able to distinguish Cambridge from Oxford, as well as 
the New York Times from DoubleClick. This was the cleanest solution we could 
find for the domain test that did not exclude large swaths of the DNS space. 
The disadvantage is that by maintaining this mapping ourselves, we run the 
risk of making mistakes. Fortunately, since the function is distributed with our 
expert knowledge database, it too can be updated remotely without reinstalling 
Bugnosis. 

We also considered but decided against using the definition of “third party” 
from RFC 2965 )9], the latest HTTP cookie specification. The issue is that RFC 
2965 considers the host “www.example.com” to be third party to the origin server 
“www.sales.example.com”. While this makes some sense within the DNS zone 
delegation model, far too many sites use a single authority for everything under 
their primary domain for this definition to work in Bugnosis. The average Web 
user would only be confused by the claim that these two sites are substantially 
different. 

RFC 2965 also exhibits some degenerate behavior with short domain names. 
For example, “images.example.com” is considered third party to the origin server 
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“example.com” , even though “images.example.com” is not considered third party 
to “www.example.com”. 

4 Implementation Issues 

4.1 Data Transport to and from Bugnosis 

We considered using proxy approach in order to discover the images on a page, 
but our previous experience with proxy-based content understanding [10] was 
not encouraging. It is very hard to model a Web browser without writing a Web 
browser. Besides, a central purpose of SSL is to prevent third party snooping, 
so proxies cannot easily inspect the content of HTTPS pages. (It is possible, 
though: a trusted proxy can install itself as a root certification authority and 
use a man-in-the-middle approach to obtain the cleartext [11].) 

The Document Object Model (DOM) [12] is well supported by Internet Ex- 
plorer and provides the hooks we needed. Instead of parsing HTML, Bugnosis 
asks IE for a DOM representation of the current Web page and then traverses it 
looking for images and their characteristics. In order to learn about changes in 
the current document, Bugnosis uses IE’s Browser Help Object [13] functionality 
to sink relevant events. So Bugnosis sees everything that IE does, even the pages 
delivered by SSL and those that load images using JavaScript long after the main 
page has been rendered. In addition, Bugnosis sees the image dimensions as they 
appear on the page, not as they are recorded in the delivered file. This is appropri- 
ate because the “tiny” property has to do with an image’s perceptibility, not its 
binary content. A 10x10 image file can be made tiny on the screen by changing 
its size with HTML attributes: <IMG SRC="http : //example. com/lOxlO. gif " 
HEIGHT="1" WIDTH="1">. (It is standard practice among Web designers to spec- 
ify the height and width attributes with images; without them, the browser’s 
HTML layout engine has to continually redraw the page as images arrive and it 
figures out how big they are.) 

Having obtained the list of images and their attributes, Bugnosis analyzes 
all of the images on all of the frames of the Web page and creates an XML 
representation of the analysis. This XML can be stored for subsequent use (as 
we did in [3]) or converted immediately for display. 

To display its results, Bugnosis creates its own secondary Web browser and 
embeds it in an Explorer “Comm Band” — the lower pane in the IE window. 
Bugnosis creates HTML directly (i.e., as text) and writes it into this object to 
display results. Sticking with HTML is a big advantage here, because it provides 
a familiar interface with hypertext support that can be easily printed, copied to 
another window, or e-mailed. 

4.2 Making Web Bugs Apparent 

Once Bugnosis has identified a Web bug, it sounds an alert and makes its analysis 
pane visible. In addition, it replaces the image URL in the DOM with a small 
image of a bug: this immediately causes the bug to become visible in the main 
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Fig. 2. An e-mail message composed by Bugnosis when the user clicked on the e-mail 
icon. 

IE window. By keeping the visible image small, Bugnosis attempts to leave the 
page layout mostly intact. Making the bug visible is helpful because Web bugs 
are often adjacent to other content that may provide a clue as to their purpose. 
In Figure I, for instance, the bug is visible next to a list of retailers that support 
the site being visited; the Web bug probably has something to do with them. 
Bugnosis also populates the image’s ALT tag with summary information about 
the bug. 

The analysis pane shows the properties supporting Bugnosis’ claims about 
the image; in Figure 1, we see the tiny, once, domaindiff, and tpcookie properties 
along with the bug’s cookie value. The phrase “recognized site” means that the 
site is in the database of known sites. 

The Bugnosis database also associates privacy policy URLs and e-mail ad- 
dresses with the sites it recognizes. Clicking on the home icon (when present) 
navigates a new IE window to the privacy policy page for the Web bug provider. 
Clicking on the e-mail icon (when present) launches the user’s e-mail program 
with a draft message addressed to the Web bug provider, which the user can 
edit and send if desired. 

This e-mail composition (see Figure 2) is by far the most “activist” feature 
in Bugnosis. Obviously, we felt that Web bug disclosure practices were generally 
inadequate when we designed this feature. Web bug disclosures may have im- 
proved since then [14], but even as late as October 2001, we found that 29% of 
popular Web sites that contain Web bugs in a U.S. Federal Trade Commission 
sample say nothing that even hints at the possibility or implications of third 
party content [3]. 
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4.3 Technical Challenges 

Object Soup. We were not terribly familiar with COM, ATL, or ActiveX when 
we began this project, and it turned out to be an excellent, if harrowing, learning 
opportunity [15, 13]. Bugnosis is packaged as a dynamically linked library (DLL) 
file that exposes three main COM objects: a tool bar for controlling Bugnosis, 
the Browser Helper Object for monitoring user-initiated browsing events, and a 
Comm Band for displaying Bugnosis output. Installation merely adds references 
to these objects to the system registry. The Bugnosis database is maintained 
separately in an XML file. 

When IE is invoked, it consults the registry and creates these three Bugnosis 
objects in an apparently unpredictable order, and without any reference to each 
other. Our first challenge was to make these objects discover each other so they 
can then share data about the current session. Ultimately we resorted to a global 
map (shared by every process and thread that loads the Bugnosis DLL) in order 
to cope with this. During a beta test period we also discovered that some versions 
of IE for NT maintain a cache that needed to be flushed at installation time; 
otherwise, IE would not create all three of the objects. 

Once the Bugnosis objects have located each other, they create several addi- 
tional objects: a subordinate Web browser object for displaying Bugnosis output, 
an XML parser, an auxiliary event sink, and other minor objects. Many of these 
objects contain references to each other, and this prevents COM’s reference 
counting architecture from freeing inaccessible resources at shutdown time with- 
out programmer assistance. Indeed, this was still broken in Bugnosis 1.1; each 
new IE window leaks some memory until all of the IE threads in that process 
exit. 

Special UI Behavior in the Bugnosis Analysis Pane. Not all of the func- 
tionality we desired in the Bugnosis output pane was achievable through HTML 
and scripting alone. In particular, the standard mailto: protocol does not pro- 
vide a way for a script to include an attachment in an e-mail message, but we 
felt it was important to include the Bugnosis output when a user sent an e-mail 
inquiry to a Web bug provider. We also wanted the main IE status line to display 
appropriate information when the user moved the pointer over links within the 
Bugnosis analysis window. (The main IE window knows only that the Bugnosis 
DLL has part of the screen real estate down there — not that Bugnosis is using 
it to display HTML.) Both issues were solved by attaching an object to the lower 
pane to sink its events and handling or propagating them appropriately. 

We attempted to use IE’s “DHTML Behaviors” [13] to automate some of 
the UI. Behaviors are programmatic content that can be associated with HTML 
elements in the same way that CSS assigns style information to HTML elements. 
For instance, we wanted to make the pointer icon and the status line change when 
the pointer crossed into the e-mail icon so the user would recognize that it is 
clickable. In addition, we wanted hyperlinks (such as http://service.bfast.com in 
Figure 1) to invoke a script when clicked, but we wanted them to otherwise look 
and behave like ordinary links. After inventing a new style for these behaviors 
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and assigning the style to the appropriate objects, we were plagued with un- 
predictable crashes. Ultimately we stopped using DHTML Behaviors altogether. 
Instead, one of the DLL objects iterates through the elements and reassigns 
event handlers whenever the output window changes content. 

Finally, we felt it was important to be able to send Bugnosis analyses to 
computers that did not have Bugnosis installed. Bugnosis therefore removes 
all of the IE-specific material when preparing its analysis for export (such as 
printing, saving, copying to the clipboard, etc.). 

4.4 Installation and Uninstallation 

Since Bugnosis is a collection of COM objects, we thought we would take the 
ActiveX plunge and use a Web-based installation method. The idea is to provide 
an installation page that includes an <0BJECT> tag. IE will recognize that this 
tag refers to an ActiveX object that must be fetched and instantiated. Under 
default settings, this will cause a familiar security dialog to appear asking if the 
user trusts the software vendor before proceeding. Once instantiated, IE will 
attempt to display it on the Web page containing the <DBJECT> tag. By then, 
Bugnosis has modified the registry for future IE sessions. 

As part of this process it is important to give the user feedback. If anything 
goes wrong during the ActiveX registration and instantiation, this needs to be 
explicitly indicated; however, we are restricted to standard scripting techniques 
for this, since Bugnosis is not yet installed. Our approach is to assume that in- 
stallation always fails. After the installation page is finished loading, IE switches 
to a troubleshooting page — unless a certain JavaScript global variable has been 
set. Meanwhile, whenever Bugnosis senses that the Bugnosis Web site is being 
visited, it injects a script into the page that sets this variable. So if the Bugno- 
sis installation actually succeeded, it changes the default course of action and 
instead leads IE to an “installation successful” page. 

When the user asks Bugnosis to uninstall itself, it simply disconnects itself 
from the registry so it will no longer be invoked in later IE sessions. It does not 
scrape the DLL off the hard drive. 

This deployment strategy has not worked as well as we had hoped. Some 
users expecting the standard download/save/install procedure have complained 
about unexpected installations even after clicking “install” in response to the 
explicit security warning. Others feel that the presence of the DLL on the disk 
after uninstallation was unsafe and requested a way to remove it. But describing 
how to remove this file is complicated because the precise directory that houses 
downloaded ActiveX objects is not consistent across all Windows installations. 
Even worse, the directory has unusual display semantics in Windows Explorer, 
so it does not show up as expected on the desktop. Even the standard “find 
file” dialog does not locate it. Luckily, command-line sessions are able to find 
and manipulate the file. Finally, the installation method just does not work on 
some systems. We have received many reports of users who saw the “installation 
failed” page but then later found that Bugnosis was actually working. We clearly 
have some more work to do here. 
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4.5 User Community Size 

Bugnosis was built by two people with only part-time attention over a period 
of several months. In the 4.5 months following the initial announcement, over 
100,000 users have downloaded and installed Bugnosis. Our Web hosting service 
quickly upgraded the site to a heavier-duty machine to deal with the initial burst 
of interest at release time, but we have no good system to deal with the 1,200 
e-mail messages we have received so far. Fortunately, much of the e-mail either 
does not require a response or is covered in the Bugnosis FAQ. Unfortunately, 
much of the rest simply does not get answered. 

5 Plans 

5.1 Web Bugs in E-mail 

Web bugs are not limited to Web pages. Users equipped with an HTML-enabled 
mail reader can also be tracked when they read e-mail. When the user views a 
bugged message, the reader will fetch all of the required images, thereby inform- 
ing a third party of the reading action, and possibly sending an identifying cookie 
in the process. Web services exist that automate the construction of Web-bugged 
e-mails [16, 17]. 

In order to prepare and send a bugged message, the sender must already 
have the recipient’s e-mail address. This bit of personal information is generally 
not available to Web sites that place Web bugs. Since Web bugs in e-mail are 
usually preserved when the e-mail is forwarded, a tracker may be able to learn 
who a target’s associates are. And with a little JavaScript, comments added to 
a message when it is forwarded may even be intercepted by the tracker under 
some circumstances [18]. For all of these reasons, the practice of bugging e-mails 
is seen as more intrusive than bugging Web pages. We hope to include an e-mail 
Web bug detector in a future release. 

5.2 Platform for Privacy Preferences Project 

Our ability to help the user contact those responsible for the Web bugs is severely 
limited by our contact database. Without a contact entry, Bugnosis offers no 
assistance and simply omits the Web page and e-mail icons in its analysis. Fur- 
thermore, our database only contains entries about Web bug providers, i.e., the 
third parties named in the Web bug URL. Thus in the example of Figure 1, 
Bugnosis provides contact information for “bfast.com”, but offers no assistance 
for contacting “photo.net”, even though both sites are necessarily involved in 
the decision to place the Web bug. 

The obvious solution is to use PSP policies [19] to search for both types of 
contact information. This would allow us to eliminate most of the “neutral” Bug- 
nosis database entries. In addition, we could allow Bugnosis to suppress warnings 
about Web bugs that have acceptable disclosures. This type of functionality is 
a high priority for future releases. 
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Abstract. Frequently, communication between two principals reveals 
their identities and presence to third parties. These privacy breaches 
can occur even if security protocols are in use; indeed, they may even be 
caused by security protocols. However, with some care, security protocols 
can provide authentication for principals that wish to communicate while 
protecting them from monitoring by third parties. This paper discusses 
the problem of private authentication and presents two protocols for 
private authentication of mobile principals. In particular, our protocols 
allow two mobile principals to communicate when they meet at a location 
if they wish to do so, without the danger of tracking by third parties. 
The protocols do not make the (dubious) assumption that the principals 
share a long-term secret or that they get help from an infrastructure of 
ubiquitous on-line authorities. 



1 Privacy, Authenticity, and Mobility 

Although privacy may coexist with communication, it often does not, and there 
is an intrinsic tension between them. Often, effective communication between 
two principals requires that they reveal their identities to each other. Still, they 
may wish to reveal nothing to others. Third parties should not be able to infer 
the identities of the two principals, and to monitor their movements and their 
communication patterns. For better or for worse, they often can. In particular, 
a mobile principal may advertise its presence at a location in order to discover 
and to communicate with certain other principals at the location, thus revealing 
its presence also to third parties. 

Authentication protocols may help in addressing these privacy breaches, as 
follows. When a principal A wishes to communicate with a principal B, and is 
willing to disclose its identity and presence to B but not to other principals, A 
might demand that B prove its identity before revealing anything. An authen- 
tication protocol can provide this proof. It can also serve to establish a secure 
channel for subsequent communication between A and B. 

However, authentication protocols are not an immediate solution, and they 
can in fact be part of the problem. Privacy is not one of the explicit goals of 
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common authentication protocols. These protocols often send names and cre- 
dentials in cleartext, allowing any eavesdropper to see them. An eavesdropper 
may also learn substantial information from encrypted packets, even without 
knowing the corresponding decryption keys; for example, the packets may con- 
tain key identifiers that link them to other packets and to certain principals. 
Furthermore, in the course of authentication, a principal may reveal its identity 
to its interlocutor before knowing the interlocutor’s identity with certainty. If A 
and B wish to communicate but each wants to protect its identity from third 
parties, who should reveal and prove theirs first? 

This last difficulty is more significant in peer-to-peer communication than 
in client-server communication, although the desire for privacy appears in both 
settings. 

— In client-server systems, the identity of servers is seldom protected. How- 
ever, the identity of clients is not too hard to protect, and this is often 
deemed worthwhile. For example, in the SSL protocol [14], a client can first 
establish an “anonymous” connection, then authenticate with the protection 
of this connection, communicating its identity only in encrypted form. An 
eavesdropper can still obtain some addressing information, but this infor- 
mation may be of limited value if the client resides behind a firewall and 
a proxy. (Similarly, the Skeme protocol [19] provides support for protecting 
the identity of the initiator A of a protocol session, but not the identity of 
the interlocutor B.) 

— The symmetry of peer-to-peer communication makes it less plausible that 
one of the parties in an exchange would be willing to volunteer its identity 
first. Privacy may nevertheless be attractive. In particular, mobile principals 
may want to communicate with nearby peers without allowing others to 
monitor them (cf. Bluetooth [7] and its weaknesses [18]). Thus, privacy seems 
more problematic and potentially more interesting in the fluid setting of 
mobile, peer-to-peer communication. 

This paper gives a definition of a privacy property (informally) . This property 
implies that each principal may reveal and prove its identity to certain other 
principals, and hide it from the rest. The definition applies even if all parties are 
peers and have such privacy requirements. 

Standard authentication protocols do not satisfy the privacy property. How- 
ever, we show two protocols that do, and undoubtedly there are others (to the 
extent that informally described protocols can satisfy informally defined prop- 
erties). In our protocols, a session between two principals A and B consists of 
messages encrypted under public keys and under session keys in such a way that 
only A and B discover each other’s identity. The protocols differ from standard 
protocols by the absence of cleartext identity information. More subtly, they 
rely on some mild but non-trivial assumptions on the underlying cryptographic 
primitives. One of the protocols also includes a subtle “decoy” message in order 
to thwart certain active attacks. 

Our protocols do not assume that the principals A and B have a long-term 
shared secret. Neither do they require an infrastructure of on-line trusted third 
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parties, or suppose that the world is organized into domains and that each princi- 
pal has a home domain. In this respect, the protocols contrast with previous ones 
for related purposes (see for example [4, 6, 23, 30] and section 5). Because of their 
weak infrastructure needs, the protocols are consistent with ad hoc networking. 

As an example, consider a mobile principal A that communicates with others 
when they are in the same (physical or virtual) location. In order to establish 
connections, A might constantly broadcast “hello, I am A, does anyone want to 
talk?”. An eavesdropper could then detect A’s presence at a particular location. 
An eavesdropper could even monitor A’s movements without much difficulty, 
given sensors at sufficiently many locations. Our protocols are applicable in this 
scenario, and are in fact designed with this scenario in mind. Suppose that two 
principals A and B arrive anonymously at a location. Although A and B may 
know of each other in advance, they need not have a long-term shared key. 
Furthermore, neither may be certain a priori that the other one is present at 
this location. If they wish to communicate with one another, our protocols will 
enable them to do it, without the danger of being monitored by others. 

The next section defines and discusses the privacy property sketched above. 
Section 3 presents the assumptions on which our protocols rely. Section 4 devel- 
ops the two protocols and some optimizations and extensions. Section 5 discusses 
some related problems and related work (including, in particular, work on mes- 
sage untraceability). Section 6 concludes. 

This paper does not include a formal analysis for the protocols presented. 
However, formalizing the protocols is mostly a routine exercise (for example, 
using the spi calculus [1] or the inductive method [25]). Reasoning about their 
authenticity and secrecy properties, although harder, is also fairly routine by 
now. More challenging is defining a compelling formal specification of the privacy 
property. Such a specification should account for any “out-of-band” knowledge 
of attackers, of the kind discussed in section 3. In this respect, placing private au- 
thentication in the concrete context of a system may be helpful. We regard these 
as interesting subjects for further work. Recently, several researchers who read 
drafts of this paper (Vitaly Shmatikov and Dominic Hughes, Hubert Comon and 
Veronique Cortier, and Cedric Fournet) have made progress on these subjects. 
Their ideas should be applicable to other systems with privacy goals, beyond 
the protocols of this paper. 

2 The Problem 

More specifically, suppose that a principal A is willing to engage in communi- 
cation with some set of other principals Sa (which may change over time), and 
that A is willing to reveal and even prove its identity to these principals. This 
proof may be required, for instance if A wishes to make a sensitive request from 
each of these principals, or if these principals would reveal some sensitive data 
only to A. The problem is to enable A to authenticate to principals in 5*^ with- 
out requiring A to compromise its privacy by revealing its identity or Sa more 
broadly: 
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1. A should be able to prove its identity to principals in Sa, and to establish 
authenticated and private communication channels with them. 

2. A should not have to indicate its identity (and presence) to any principal 
outside Sa- 

3. Although an individual principal may deduce whether it is in Sa from A’s 
willingness to communicate, A should not have to reveal anything more 
about Sa- 

Goal 1 is common; many cryptographic protocols and security infrastructures 
have been designed with this goal in mind. 

Goal 2 is less common. As discussed above, it is seldom met with standard 
protocols, but it seems attractive. When C is a principal outside Sa, this goal 
implies that A should not have to prove its identity to C, but it also means that 
A should not have to give substantial hints of its identity to C - 

We could consider strengthening goal 2 by saying that A should have to 
reveal its identity only to principals B G Sa such that A G Sb, in other words, 
to principals with which A can actually communicate. However, we take the view 
that Sb is under B's control, so B could let A G 5's, or pretend that this is the 
case, in order to learn A’s identity. At any rate, this variant seems achievable, 
with some additional cost; it may deserve study. 

Goal 3 concerns a further privacy guarantee. Like goal 2, it is somewhat 
unusual, seldom met with standard techniques, but attractive from a privacy 
perspective. It might be relaxed slightly, in particular allowing A to reveal the 
approximate size of Sa- 

Note that A may be willing to engage in anonymous communication with 
some set of principals in addition to Sa- We assume that A is programmed 
and configured so that it does not spuriously reveal its identity (or other pri- 
vate data) to those other principals accidentally. This assumption is non-trivial: 
in actual systems, principals may well reveal and even broadcast their names 
unnecessarily. 

3 Assumptions 

This section introduces the assumptions on which our protocols rely. They gen- 
erally concern communication and cryptography, and the power of the adversary 
in these respects. (Menezes et al. [22] give the necessary background in cryptog- 
raphy; we rely only on elementary concepts.) Although the assumptions may not 
hold in many real systems, they are realistic enough to be implementable, and 
advantageously simple. 



3.1 Communication 

We assume that messages do not automatically reveal the identity of their 
senders and receivers — for example, by mentioning them in headers. When the lo- 
cation of the sender of a message can be obtained, for example, by triangulation. 
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this assumption implies that the location does not reveal the sender’s identity. 
This assumption also entails some difficulties in routing messages. Techniques for 
message untraceability (see for example [10, 26, 27] and section 5) suggest some 
sophisticated solutions. Focusing on a relatively simple but important case, we 
envision that all messages are broadcast within some small area, such as a room 
or a building. 

We aim to protect against an adversary that can intercept any message sent 
on a public channel (within the small area under consideration or elsewhere). In 
addition, the adversary is active: it can send any message that it can compute. 
Thus, the adversary is essentially the standard adversary for security protocols, 
as described, for example, by Needham and Schroeder [24]. 

We pretend that the adversary has no “out-of-band” information about the 
principals with which it interacts, that is, no information beyond that provided 
by the protocols themselves. This pretense is somewhat unrealistic, but it is a 
convenient simplification, as the following scenario illustrates. Suppose that three 
principals A, B, and C are at the same location, alone. Suppose further that A 
and B are willing to communicate with one another, and that A is willing to 
communicate with C, but B is not. Therefore, the presence of B should remain 
hidden from C. However, suppose that C suspects that Sa = or that 

A even tells C that Sa = {B,C}. When C sees traffic between A and someone 
else, C may correctly deduce that B is present. Our simplification excludes this 
troublesome but artificial scenario. 



3.2 Cryptography 

We also assume that each principal A has a public key Ka and a corresponding 
private key and that the association between principals and public keys is 

known. This association can be implemented with the help of a mostly-off-line 
certification authority. In this case, some additional care is required: fetching 
certificates and other interactions with the certification authority should not 
compromise privacy goals. Alternatively, the association is trivial if we name 
principals by their public keys, for example as in SPKI [12]. Similarly, it is also 
trivial if we use ordinary principal names as public keys, with an identity-based 
cryptosystem [31]. Therefore, we may basically treat public keys as principal 
names. 

When K~^ is a private key, we write for M signed using in 

such a way that M can be extracted from and the signature verified 

using the corresponding public key K. As usual, we assume that signatures 
are unforgeable. Similarly^, when K is a, public key, we write {M}k for the 
encryption of M using K . We expect some properties of the encryption scheme: 

^ These notations are concise and fairly memorable, but perhaps somewhat misleading. 
In particular, they imply that the same key pair is used for both public-key signatures 
and encryptions, and that the underlying algorithms are similar for both kinds of 
operations (as in the RSA cryptosystem). We do not need to assume these properties. 
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1. Only a principal that knows the corresponding private key K ^ should be 
able to understand a message encrypted under a public key K. 

2. Furthermore, decrypting a message with a private key K~^ should succeed 
only if the message was encrypted under the corresponding public key K, 
and the success or failure of a decryption should be evident to the principal 
who performs it. 

3. Finally, encryption should be which-key concealing [3, 5,8], in the following 
sense. Someone who sees a message encrypted under a public key K should 
not be able to tell that it is under K without knowledge of the corresponding 
private key K~^, even with knowledge of K or other messages under K. 
Similarly, someone who sees several messages encrypted under a public key 
K should not be able to tell that they are under the same key without 
knowledge of the corresponding private key K~^. 

Property 1 is essential and standard. Properties 2 and 3 are not entirely stan- 
dard. They are not implied by standard computational specifications of encryp- 
tion (e.g., [15]) but appear in formal models (e.g., [1]). Property 2 can be imple- 
mented by including some checkable redundancy in encrypted messages, without 
compromising secrecy properties. It is not essential, but we find it convenient, 
particularly for the second protocol and its enhancements. Property 3 is satis- 
fied with standard cryptosystems based on the discrete- logarithm problem [5, 8], 
but it excludes implementations that tag all encryptions with key identifiers. 
Although the rigorous study of this property is relatively recent, it seems to be 
implicitly assumed in earlier work; for example, it seems to be necessary for the 
desired anonymity properties of the Skeme protocol [19]. 

4 Two Protocols 

This section shows two protocols that address the goals of section 2. It also 
discusses some variants of the protocols. 

The two protocols are based on standard primitives and techniques (in par- 
ticular on public-key cryptography), and resemble standard protocols. The first 
protocol uses digital signatures and requires that principals have loosely syn- 
chronized clocks. The second protocol uses only encryption and avoids the syn- 
chronization requirement, at the cost of an extra message. The second protocol 
draws attention to difficulties in achieving privacy against an active adversary. 

Undoubtedly, other protocols satisfy the goals of section 2. In particular, 
these goals seem relatively easy to satisfy when all principals confide in on-line 
authentication servers. However, the existence of ubiquitous trusted servers may 
not be a reasonable assumption. The protocols of this section do not rely on 
such trusted third parties. 

4.1 First Protocol 

In the first protocol, when a principal A wishes to talk to another principal 
B G Sa, they proceed as follows: 
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— A generates fresh key material K and a timestamp T, and sends out 

“hello”, {“hello”, Ka, {Ka, Kb, K, T}^-i}kb 

The key material may simply be a session key, for subsequent communication; 
it may also consist of several session keys and identifiers for those keys. The 
signature means that the principal with public key Ka (that is. A) says that 
it has generated the key material K for communicating with the principal 
with public key Kb (that is, B) near time T. The timestamp protects against 
replay attacks. 

— Upon receipt of any message that consists of “hello” and (apparently) a 
ciphertext, the recipient B decrypts the second component using its private 
key. If the decryption yields a key Ka and a signed statement of the form 
{Ka, Kb, K, T}^-i, then B extracts Ka and K, verifies the signature using 
Ka, and checks the timestamp T against its clock. If the plaintext is not of 
the expected form, or if A ^ Sb, then B does nothing. 

— A and B may use K for encrypting subsequent messages. Each of these mes- 
sages may be tagged with a key identifier, derived from K but independent 
of A and B. When A or B receives a tagged message, the key identifier 
suggests the use of K for decrypting the message. 

This protocol is based on the Denning-Sacco public-key protocol and its 
corrected version [2, 11]. Noticeably, however, this protocol does not include any 
identities in cleartext. In addition, the protocol requires stronger assumptions 
on encryption, specifically that public-key encryption under Kb be which-key 
concealing. This property is needed so that A’s encrypted message does not 
reveal the identity of its (intended) recipient B. 

When A wishes to communicate with several principals B\, . . . , Bn at the 
same time (for example, when A arrives at a new location), A may simply start n 
instances of the protocol in parallel, sending different key material to each of Bi, 

. . . , Bn- Those of Bi, . . . , Bn who are present and willing to communicate with A 
will be able to do so using the key material. (Section 4.3 describes optimizations 
of the second protocol for this situation.) 



4.2 Second Protocol 

In the second protocol, when a principal A wishes to talk to another principal 
B G Sa, they proceed as follows: 

— A generates a fresh, unpredictable nonce Na, and sends out 

“hello”, (“hello”, Na, Ka}kb 



(In security protocols, nonces are quantities generated for the purpose of 
being recent; they are typically used in challenge-response exchanges.) 
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— Upon receipt of any message that consists of “hello” and (apparently) a 
ciphertext, the recipient B tries to decrypt the second component using its 
private key. If the decryption succeeds, then B extracts the corresponding 
nonce and key checks that A £ Sb, generates a fresh, unpredictable 
nonce Nb, and sends out 

“ack”, {“ack”, Na, Nb, Kb}ka 

If the decryption fails, if the plaintext is not of the expected form, or if 
A ^ Sb, then B sends out a “decoy” message. This message should basically 
look like B's other message. In particular, it may have the form 

“ack”, {N}k 

where TV is a fresh nonce (with padding, as needed) and only B knows K~^, 
or it may be indistinguishable from a message of this form. 

— Upon receipt of a message that consists of “ack” and (apparently) a cipher- 
text, A tries to decrypt the second component using its private key. If the 
decryption succeeds, then A extracts the corresponding nonces Na and Nb 
and key Kb, and checks that it has recently sent Na encrypted under Kb- 
If the decryption or the checks fail, then A does nothing. 

— Subsequently, A and B may use Na and Nb as shared secrets. In particular, 
they may compute one or more session keys by concatenating and hashing 
the nonces. They may also derive key identifiers, much as in the first protocol. 

In summary, the message flow of a successful exchange is: 

A^ B : “hello”, {“hello”, Na, Ka}kb 
B ^ A: “ack”, (“ack”, Na, Nb, Kb}ka 

Section 4.3 describes variants of this basic pattern, for example (as mentioned 
above) for the case where A wishes to communicate with n principals Bi, . . . , 
Bn- 

This protocol has some similarities with the Needham-Schroeder public-key 
protocol [24] and others [19,20]. However, like the first protocol, this one does 
not include any identities in cleartext, and again that is not quite enough for 
privacy. As in the first protocol, public-key encryption should be which-key con- 
cealing so that encrypted messages do not reveal the identities of their (intended) 
recipients. Furthermore, the delicate use of the decoy message is important: 

— H’s decoy message is unfortunately necessary in order to prevent an attack 
where a malicious principal C (ji Sb computes and sends 

“hello”, (“hello”, Nc, Ka\kb 

and then deduces H’s presence and A £ Sb by noticing a response. In order 
to prevent this attack, the decoy message should look to C like it has the 
form “ack”, (“ack”, Nc, Nb, Kb}ka- 
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— B’s response to A when A ^ Sb should look as though B was someone 
else, lest A infer B’s presence. Since B sends a decoy message when its 
decryption fails, it should also send one when A ^ Sb- For this purpose, the 
decoy message should look to A like one that some unknown principal would 
send in response to A’s message. 

The decoy message “ack”, {N}x: is intended to address both of these require- 
ments. 



4.3 Efficiency Considerations 

Both protocols can be rather inefficient in some respects. These inefficiencies are 
largely unavoidable consequences of the goals of private authentication. 

— A generates its message and sends it before having any indication that B 
is present and willing to communicate. In other situations, A might have 
first engaged in a lightweight handshake with B, sending the names A and 
B and waiting for an acknowledgment. Alternatively, both A and B might 
have broadcast their names and their interest in communicating with nearby 
principals. Here, these preliminary messages are in conflict with the privacy 
goals, even though they do not absolutely prove the presence of A and B to 
an eavesdropper. Some compromises may be possible; for example, A and 
B may publish some bits of information about their identities if those bits 
are not deemed too sensitive. In addition, in the second protocol, A may 
precompute its message. 

— Following the protocols, B may examine many messages that were encrypted 
under the public keys of other principals. This examination may be costly, 
perhaps opening the door to a denial-of-service attack against B. In other 
situations, A might have included the name B, the key Kb, or some identifier 
for Kb in clear in its message, as a hint for B. Here, again, the optimization 
is in conflict with the privacy goals, and some compromises may be possible. 

The second protocol introduces some further inefficiencies, but those can be 
addressed as follows: 

— In the second protocol, A may process many acknowledgments that were 
encrypted under the public keys of other principals. This problem can be 
solved through the use of a connection identifier: A can create a fresh iden- 
tifier /, send it to B, and B can return / in clear as a hint that A should 
decrypt its message: 

A^ B : “hello”, I, {“hello”, Na, Ka}kb 
B ^ A: “ack”, I, (“ack”, Na, Nb, Kb}ka 

The identifier I should also appear in B’s decoy message. Third parties may 
deduce that the messages are linked, because I is outside the encryptions, 
but cannot relate the messages to A and B. 
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— Suppose that A wishes to communicate with several principals, B\, . . . , Bn- 
It could initiate n instances of the protocol. However, combining the messages 
from all the instances can be faster. In particular, although each of i?i, . . . , 
Bn should receive a different nonce, they can all share a connection identifier. 
Moreover, when Ka is long, its public-key encryption may be implemented 
as a public-key encryption of a shorter symmetric key K plus an encryption 
of Ka using K; the key K and the latter encryption may be the same for 
Bi, . . . , Bn- Thus, A may send: 

“hello”, I, {Ka}k, {“hello”, if (Ka), Nai, 

(“hello”, K(K a), Nau, K}ks^ 

where H is & one-way hash function. Most importantly, the need for decoy 
messages is drastically reduced. A principal that plays the role of B need 
not produce n true or decoy acknowledgments, but only one. Specifically, 
B should reply to a ciphertext encrypted under Kb, if A included one in 
its message, and send a decoy message otherwise. This last optimization 
depends on our assumption that B can recognize whether a ciphertext was 
produced by encryption under Kb- 

With these and other improvements, both protocols are practical enough 
in certain systems, although they do not scale well. Suppose that principals 
wish to communicate with few other principals at a time, and that any one 
message reaches few principals, for instance because messages are broadcast 
within small locations; then it should be possible for principals that come into 
contact to establish private, authenticated connections (or fail to do so) within 
seconds. What is “few”? A simple calculation indicates that 10 is few, and maybe 
100 is few, but 1000 is probably not few. Typically, the limiting performance 
factor will be public-key cryptography, rather than communications: each public- 
key operation takes a few milliseconds or tens of milliseconds in software on 
modern processors (e.g., [21]). Perhaps the development of custom cryptographic 
techniques (flavors of broadcast encryption) can lead to further efficiency gains. 

4.4 Groups 

In the problem described above, the set of principals Sa and Sb with which 
A and B wish to communicate, respectively, are essentially presented as sets of 
public keys. In variants of the problem, Sa, Sb, or both may be presented in 
other ways. The protocols can be extended to some situations where a principal 
wants to deal with others not because of their identities but because of their 
attributes or memberships in groups, such as “ACME printers” or “Italians”. 
These extensions are not all completely satisfactory. 

— Suppose that B is willing to communicate with any principal in a certain 
group, without having a full list of those principals. However, let us still 
assume that Sa is presented as a set of public keys. In this case, we can 
extend our protocols without much trouble: A can include certificates in its 
encrypted message to B, proving its membership in groups. 
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— Suppose that, instead, A wants to communicate with any principal in a 
certain group, and Sb is presented as a set of public keys. The roles in the 
protocols may be reversed to handle this case. 

— However, the protocols do not address the case in which neither Sa nor Sb 
is presented as a set of public keys, for example when both are presented as 
groups. Introducing group keys may reduce this case to familiar ones, but 
group keys can be harder to manage and protect. 

5 Related Problems and Related Work 

The questions treated here are broadly related to traffic analysis, and how to 
prevent it. This subject is not new, of course. In particular, work on message 
untraceability has dealt with the question of hiding (unlinking) the origins and 
destinations of messages (e.g., [10, 26, 27]). It has produced techniques that allow 
a principal A to send messages to a principal B in such a way that an adversary 
may know the identities of A and B and their locations, but not that they are 
communicating with one another. Those techniques address how to route a mes- 
sage from A to B without leaking information. In the case of cellular networks, 
those techniques can be adapted to hide the locations of principals [13,28]. In 
contrast, here we envision that all messages are broadcast within a location, sim- 
plifying routing issues, and focus on hiding the identities of principals that meet 
and communicate at the location. Other interesting work on untraceability in 
mobile networks has addressed some important authentication problems under 
substantial infrastructure assumptions, for instance that each principal has a 
home domain and that an authentication server runs in each domain [4, 23, 30]. 
That work focuses on the interaction between a mobile client and an authen- 
tication server of a domain that the client visits, typically with some privacy 
guarantees for the former but not for the latter. In contrast, we do not rely on 
those infrastructure assumptions and we focus on the interaction between two 
mobile principals with potentially similar privacy requirements. 

There has been other research on various aspects of security in systems with 
mobility (e.g., [9, 32, 33] in addition to [4, 6, 13, 18, 23, 30], cited above). Some of 
that work touches on privacy issues. In particular, the work of Jakobsson and 
Wetzel points out some privacy problems in Bluetooth. The protocols of this 
paper are designed to address such problems. 

The questions treated here are also related to the delicate balance between 
privacy and authenticity in other contexts. This balance plays an important role 
in electronic cash systems (e.g., [16]). It can also appear in traditional access 
control. Specifically, suppose that A makes a request to B, and that A is member 
of a group that appears in the access control list that B consults for the request. 
In order to conceal its identity, A might use a ring signature [29] for the request, 
establishing that the request is from a member of the group without letting B 
discover that A produced the signature. However, it may not be obvious to A that 
showing its membership could help, and B may not wish to publish the access 
control list. Furthermore, A may not wish to show all its memberships to B. 
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Thus, there is a conflict between privacy and authenticity in the communication 
between A and B. No third parties need be involved. In contrast, we do not 
guarantee the privacy of A and B with respect to each other, and focus on 
protecting them against third parties. 

Designated verifier proofs address another trade-off between confidentiality 
and authenticity [17]. They allow a principal A to construct a proof that will 
convince only a designated principal B. For instance, only B may be convinced 
of A’s identity. Designated verifier proofs differ from the protocols of this paper 
in their set-up and applications (e.g., for fair exchange). Moreover, in general, 
they may leak information about A and B to third parties, without necessarily 
convincing them. Therefore, at least in general, they need not provide a solution 
to the problem of private authentication treated in this paper. 

6 Conclusion 

Security protocols can contribute to the tension between communication and pri- 
vacy, but they can also help resolve it. In this paper, we construct two protocols 
that allow principals to authenticate with chosen interlocutors while hiding their 
identities from others. In particular, the protocols allow mobile principals to com- 
municate when they meet, without being monitored by third parties. The pro- 
tocols resemble standard ones, but interestingly they rely on some non-standard 
assumptions and messages to pursue non-standard objectives. As virtually all 
protocols, however, they are only meaningful in the context of larger systems. 
They are part of a growing suite of technical and non-technical approaches to 
privacy. 
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Abstract. In this paper we look closely at the popular metric of anony- 
mity, the anonymity set, and point out a number of problems associated 
with it. We then propose an alternative information theoretic measure 
of anonymity which takes into account the probabilities of users sending 
and receiving the messages and show how to calculate it for a message 
in a standard mix-based anonymity system. We also use our metric to 
compare a pool mix to a traditional threshold mix, which was impossible 
using anonymity sets. We also show how the maximum route length re- 
striction which exists in some fielded anonymity systems can lead to the 
attacker performing more powerful traffic analysis. Finally, we discuss 
open problems and future work on anonymity measurements. 



1 Introduction 

Remaining anonymous has been an unsolved problem ever since Captain Nemo. 
Yet in some situations we would like to provide guarantees of a person remaining 
anonymous. However, the meaning of this, both on the internet and in real 
life, is somewhat elusive. One can never remain truly anonymous, but relative 
anonymity can be achieved. For example, walking through a crowd of people does 
not allow a bystander to track your movements (though be sure that your clothes 
do not stand out too much). We would like to express anonymity properties in 
the virtual world in a similar fashion, yet this is more difficult. The users would 
like to know whether they can be identified (or rather the probability of being 
identified) . Similarly, they would like to have a metric to compare different ways 
of achieving anonymity: what makes you more difficult to track in London — 
walking through a crowd or riding randomly on the underground for a few hours? 

In this paper, we choose to abstract away from the application level issues 
of anonymous communication such as preventing the attacker from embedding 
URLs pointing to the attacker’s webpage in messages in the hope that the vic- 
tim’s browser opens them automatically. Instead, we focus on examining ways of 
analysing the anonymity of a messages going through mix-based anonymity sys- 
tems |Mq in which all network communication is observable by the attacker. 

In such a system, the sender, instead of passing the message directly to the 
recipient, forwards it via a number of mixes. Each mix waits for n messages to 

R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. 2003. 

© Springer- Verlag Berlin Heidelberg 2003 
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arrive before decrypting and forwarding them in a random order, thus hiding 
the correspondence between incoming and outgoing messages. 

Perhaps the most intuitive way of measuring the anonymity of a message M 
in a mix system is to just count the number of messages M has been mixed 
with while passing through the system. However, as pointed out in and 

rTTTrni . this is not enough as all the other messages could, for instance, come 
from a single known sender. Indeed, the attacker may mount the so called n — 1 
attack based on this observation by sending n — 1 of their own messages to each 
of the mixes on M’s path. In this case, the receiver of M ceases to be anonymous. 

Another popular measure of anonymity is the notion of anonymity set. In the 
rest of this section we look at how anonymity sets have previously been defined 
in the literature and what systems they have been used in. 

1.1 Dining Cryptographers’ Networks 

The notion of anonymity set was introduced by Chaum in [( ;ha.SS] in order to 
model the security of Dining Cryptographers’ (DC) networks. The size of the 
anonymity set reflects the fact that even though a participant in a Dining Cryp- 
tographers’ network may not be directly identifiable, the set of other participants 
that he or she may be confused with, can be large or small, depending on the 
attacker’s knowledge of particular keys. The anonymity set is defined as the set 
of participants who could have sent a particular message, as seen by a global 
observer who has also compromised a set of nodes. Chaum argues that its size 
is a good indicator of how good the anonymity provided by the network really 
is. In the worst case, the size of the anonymity set is I, which means that no 
anonymity is provided to the participant. In the best case, it is the size of the 
network, which means that any participant could have sent the message. 

1.2 Stop and Go Mixes 

In [K KHfl8] Kesdogan et al. also use sets as the measure of anonymity. Fur- 
thermore, they define the anonymity set of users as those who had a non-zero 
probability of having the role TZ (sender or recipient) for a particular message. 
The size of the set is then used as the metric of anonymity. Furthermore, de- 
terministic anonymity is defined as the property of an algorithm which always 
yields anonymity sets of size greater than 1. 

The authors also state that it is necessary to protect users of anonymity 
systems against the n — 1 attack described earlier and propose two different 
ways doing so: the Stop-and-Go-mixes and a scheme for mix cascadetQ. Stop- 
and-Go are a variety of mixes that, instead of waiting for a particular number 
of messages to arrive, flush them according to some delay which is included in 
the message. They protect against the n — 1 attack by discarding the messages 
if they are received outside the specified time frame. Thus, the attacker cannot 
delay messages which is required to mount the n — 1 attack. 

^ An anonymity system based on mix cascades is one where all the senders send all 
their messages through one particular sequence of mixes. 
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1.3 Standard Terminology 

In an effort to standardise the terminology used in anonymity and pseudonymity 
research publications and clarify different concepts, Pfitzmann and Kohntopp 
[IPKOOj define anonymity itself as: 

“Anonymity is the state of being not identifiable within a set of subjects, 
the anonymity set.” 

In order to further refine the concept of anonymity and anonymity set and 
in an attempt to find a metric for the quality of the anonymity provided they 
continue: 

“Anonymity is the stronger, the larger the respective anonymity set is 
and the more evenly distributed the sending or receiving, respectively, 
of the subjects within that set is.” 

The concept of “even distribution” of the sending or receiving of members 
of the set identifies a new requirement for judging the quality of the anonymity 
provided by a particular system. It is not obvious anymore that the size is a 
very good indicator, since different members may be more or less likely to be 
the sender or receiver because of their respective communication patterns. 

2 Difficulties with Anonymity Set Size 

The attacks against DC networks presented in [Cha88| can only result in par- 
titions of the network in which all the participants are still equally likely to 
have sent or received a particular message. Therefore the size of the anonymity 
set is a good metric of the quality of the anonymity offered to the remaining 
participants. In the Stop-and-Go system jKt';ti^8) definition, the authors realise 
that different senders may not have been equally likely to have sent a particular 
message, but choose to ignore it. We note, however, that in the case they are 
dealing with (mix cascades in a system where each mix verifies the identities of 
all the senders), all senders have equal probability of having sent (received) the 
message. In the standardisation attempt |PKflD] . we see that there is an attempt 
to state, and take into account this fact in the notion of anonymity, yet a formal 
definition is still lacking. 

We have come to the conclusion that the potentially different probabilities 
of different members of the anonymity set actually having sent or received the 
message are unwisely ignored in the literature. Yet they can give a lot of extra 
information to the attacker. 



2.1 The Pool Mix 

To further emphasise the dangers of using sets and their cardinalities to assess 
and compare anonymity systems, we note that some systems have very strong 
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Fig. 1. A Pool Mix 



“anonymity set” properties. We take the scenario in which the anonymity set 
of a message passing through a mix includes (at least) the senders of all the 
messages which have ever passed through that mix. 

This turns out to be the case for the “pool mix” introduced by Cottrell 
in ICot94l . This mix always stores a pool of n messages (see Figure QJ). When 
incoming N messages have accumulated in its buffer, it picks n randomly out 
of the n + it has, and stores them, forwarding the other ones in the usual 
fashion. Thus, there is always a small probability of any message which has 
ever been through the mix not having left it. Therefore, the sender of every 
message should be included in the anonymity set (we defer the formal derivation 
of this fact until Section El- At this point we must consider the anonymity 
provided by this system. Does it really give us very strong anonymity guarantees 
or is measuring anonymity using sets inappropriate in this case? Our intuition 
suggests the latteJl, especially as the anonymity set seems to be independent of 
the size of the pool, n. 



2.2 Knowledge Vulnerability 

Yet another reason for being sceptical of the use of anonymity sets is the vulner- 
ability of this metric against an attacker’s additional knowledge. Consider the 
arrangement of mixes in Figure |21 The small squares in the diagram represent 
senders, labelled with their name. The bigger boxes are mixes, with threshold of 
2. Some of the receivers are labelled with their sender anonymity sets. 

Notice that if the attacker somehow establishes the fact that, for instance, 
A is communicating with R, he can derive the fact that S received a message 
from E. Indeed, to expose the link E ^ S, all the attacker needs to know is that 
one of A, B, C, D is communicating to R. And yet this is in no way reflected 
in S”s sender anonymity set (although E’s receiver anonymity set, as expected, 
contains just R and S). 

^ A side remark is in order here. In a practical implementation of such a mix, one 
would, of course, put an upper limit on the time a message can remain on the mix 
with a policy such as: “All messages should be forwarded on within 24 hours -|- K 
mix flushes of arrival” . 
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Fig. 2. Vulnerability of Anonymity Sets 



It is also clear that not all senders in this arrangement are equally vulnerable 
to this, as is the fact that other arrangements of mixes may be less so. Although 
we have highlighted the attack here by using mixes with threshold of 2, it is clear 
that the principle can be used in general to cut down the size of the anonymity 
set. 



3 Entropy 

We have now discussed several separate and, in our view, important issues with 
using anonymity sets and their cardinalities for measuring anonymity. We have 
also demonstrated that there is a clear need to reason about information con- 
tained in probability distributions. One could therefore borrow mathematical 
tools from Information Theory [IShaTRj . The concept of entropy was first intro- 
duced to quantify the uncertainty one has before an experiment. We now proceed 
to define our anonymous communication model and the metrics that use entropy 
to describe its quality. The model is very close to the one described in [IKEUflRj . 

Definition 1. Given a model of the attacker and a finite set of all users W , let 
r € TZ he a role for the user (TZ = {sender, recipient}) with respect to a message 
M. Let U be the attacker’s a-posteriori probability distribution of users u G L' 
having the role r with respect to M. 

In the model above we do not have an anonymity set but an r anonymity 
probability distribution lA. For the mathematically inclined, W : S' x 7?. — ^ [0, 1] 
s.t. other words, given a message M, we have a probability 

distribution of its possible senders and receivers, as viewed by the attacker. 
U may assign zero probability to some users which means that they cannot 
possibly have had the role r for the particular message M.. For instance, if the 
message we are considering was seen by the attacker as having arrived at Q, 
then U {receiver, Q) = 1 and VS” yf Q U {receiver, 5') = 0 0. If all the users that 

® Alternatively, we may choose to view the sender/receiver anonymity probability 
distribution for a message A4 as an extension of the underlying sender/receiver 
anonymity set to a set of pairs of users with their associated (non-zero) probabilities 
of sending or receiving it. 
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are not assigned a zero probability have an equal probability assigned to them, 
as in the case of a DC network under attack, then the size of the set could be 
used to describe the anonymity. The interesting case is when users are assigned 
different, non zero probabilities. 

Definition 2. We define the effective size S of an r anonymity probability dis- 
tribution U to be equal to the entropy of the distribution. In other words 



where Pu=U {u, r) . 

One could interpret this effective size as the number of bits of additional 
information that the attacker needs in order to definitely identify the user u 
with role r for the particular message At . It is trivial to show that if one user is 
assigned a probability of 1 then the effective size of is 0 bits, which means that 
the attacker already has enough information to identify the user. 

There are some further observations: 

— It is always the case that 0 < 5 < log 2 

— If iS = 0 the communication channel is not providing any anonymity. 

— If for all possible attacker models, S = log 2 |'?'| the communication channel 

provides perfect TZ anonymity. 

We now go on to show how to derive the discrete probability distribution 
required to calculate the information theoretic metric of anonymity presented 
above. 

3.1 Calculating the Anonymity Probability Distribution 

We now show how to calculate the sender anonymity probability distribution for 
a particular message passing through a mix system with the standard threshold 
mixes. We assume that we have the ability to distinguish between the different 
senders using the system. This assumption is discussed in Sectional To analyse 
a run of the system (we leave this notion informal), we have to have knowledge 
of all of the messages which have been sent during the run. (This includes mix- 
user, user-mix and mix-mix messages and is consistent with the model of the 
attacker who sees all the network communications, but has not compromised 
any mixes.) The analysis attaches a sender anonymity probability distribution 
to every message. The starting state is illustrated in Figure Ek. 

We take the case of the attacker performing “pure” traffic analysis. In other 
words, he does not have any a-priori knowledge about the senders and receivers 
and the possible communications between thenE. The attacker’s assumption 
arising from this is that a message, having arrived at a mix, was equally likely 
to have been forwarded to all of the possible “next hops”, independent of what 
that next hop could be. 

^ This is a simplification. In practice, the attacker analysing email can choose to assign 
lower probabilities to, for example, potential Greek senders of an email in Russian 
which arrived in Novosibirsk. 




MG'? 



Towards an Information Theoretic Metric for Anonymity 



47 



A 



B 







b) 



Fig. 3. a) The start of the analysis, b) Deriving the anonymity probability distribution 
of messages going through a mix 



For a general mix with n incoming messages with anonymity probability 
distributions Lq . . . Ln-i, which we view as sets of pairs, we observe that the 
anonymity probability distributions of all of the messages coming out of the 
mix, are going to be the same. This distribution A is defined as follows: 

(x,p) £ .4 iff 3i.{x,p') G Li and 

P = ^ 

n 

Thus, the anonymity probability distribution on each of the outgoing arrows 
on Figure Ell is {(A, |), (B, |), (C, |)}. 

In the next section we will discuss how we can calculate the effective anony- 
mity size of systems composed of other mixes. 



3.2 Composing Mix Systems 

Given some arrangement of individual mixes connected together, it is possible to 
calculate the maximum effective anonymity size of the system from the effective 
anonymity size of its components. The assumption necessary to do this is that 
the inputs to the different entry points of this system originate from distinct 
users. In practice this assumption is very difficult to satisfy, but at least we can 
get an upper bound on how well a “perfect” system would perform. 

Assume that there are I mixes each with effective sender anonymity size 
Si,0 < % < 1. Each of these mixes sends some messages to a mix we shall call 
sec. The probability a message going into sec originated from mix i is pi,0 < 

i < = 1- 

Using our definitions it is clear that Sgec = J2o<i<iPi^^s(Pi) effective 

anonymity size of this second mix. 

The effective sender anonymity size of messages going through the system 
described above is J2Q<i<iJ2o<j<f{i)PjPi^^s(PjP^) which simplifies to 



Stotal — Ssec T 



E 



PiSi 



0<i<l 
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where /(i) is the number of inputs that mix i takes and Pj,Q < j < /(*) is the 
probability corresponding to the input of i. 

Using this rule we can calculate the effective sender anonymity set size of mix 
systems or other anonymous communication channels using the effective sizes of 
their components and information about how they are interconnected. A similar 
approach can be used to calculate the effective recipient anonymity set size. 

In the next section, we look at how knowledge about the system available to 
the attacker can be used to perform a better anonymity analysis. 

4 Route Length 

Having included probabilities in the model and demonstrated that they can give 
the attacker more information about the system than just anonymity sets, we 
note that the standard attacks aimed at reducing the size of the anonymity set 
will now have the effect of narrowing the anonymity probability distribution. 
If we consider this distribution as a set of pairs (of a sender and its respective 
non-zero probability of having sent the message), then narrowing the probability 
distribution is the process of deriving that some senders have zero probability of 
sending the message and can therefore be safely excluded from the set. 

We now look at an attack which not only has the ability to narrow the prob- 
ability distribution, but also to alter it in such a way as to reduce the entropy 
of the anonymity probability distributions without affecting the underlying ano- 
nymity set. 

As suggested in IHUSOOI . route length is important and some arrangements 
of mixes are more vulnerable to route length based attacks than others. Here, 
we demonstrate that maximum route length should be taken into account when 
calculating anonymity sets. Note that, of course, the maximum route length in 
a traditional mix-based anonymity system exists and is known to the attackeJH 
Several mix systems have been designed to remove the maximum route length 
constraint, for instance via tunnelling in Onion Routing [STblRfl) or Hybrid 
mixes pinDi, but it exists in fielded systems such as Mixmaster |MCflflj (max- 
imum route length of 20) and so can be used by the attacker. 

It may also be possible to obtain relevant information by compromising a mix. 
Some mix systems will allow a mix to infer the number of mixes a message has 
already passed through and therefore the maximum number of messages it may 
go through before reaching the destination. Such information would strengthen 
our attack, so care needs to be taken to design mix systems (such as Mixmaster 
Esm) which do not give it away. 

We illustrate the problem by example. Consider the situation in Figure 01 
where each arrow represents a message observed by the attacker. Now let us 
suppose that the maximum route length is 2, i.e. any message can pass through 

® The reason for this is standard, as follows: All the messages in a mix-based system 
have to have the same size, otherwise an attacker could trace particular messages. 
Yet each message (when leaving the sender) has to include iuside it all the addresses 
of all the servers it will be forwarded via. Thus, there is a limit on the number of 
the mixes a message can pass through, and it is known to the attacker. 
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Fig. 4. Using maximum route length to reduce anonymity sets 




Fig. 5. Using maximum route length to alter anonymity probability distribution 



no more than 2 mixes. The arrow from the bottom to the rightmost mix could 
only have been the message from C as otherwise this message, coming from A 
or B would have gone through 3 mixes. From this, we infer that C was not 
communicating to S, which makes S's sender anonymity set {A, B}. Of course, 
without taking maximum route length into account, this anonymity set would 
have been {A,B,C}. 

We now illustrate how the same fact can alter the sender anonymity proba- 
bility distribution of a particular receiver and therefore reduce its entropy. 

Here we use the same arrangement of mixes, but look at a different receiver, 
Q. The anonymity probability distribution worked out using the algorithm (with- 
out the route length constraint) in the previous section is shown in Figure 0 If 
the attacker knows that the maximum route length is 2, the arrow from mix 2 to 
mix 3 has the sender probability distribution of: {(C, 1)} and thus the probabil- 
ity distribution at Q (or R) is {{A, |), (5, |), (C, |)}. This reduced the entropy 
from 1.5613 down to 1.5. Compare this with the entropy of 1.585 for a uniform 
distribution. It is also worth noting that an attack which eliminated one of the 
possible senders, would reduce the entropy to at most 1 bit and an attack which 
would expose a single host as the sender of a particular message to Q — to 0 
bits. Thus, our metric is capable of not only comparing the effectiveness of sys- 
tems, but also the power of different attacks. A similar idea has been proposed 
by (nSCPfl;^ . However, comparing “the effectiveness” of different attacks using 
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this method in general is beyond the scope of this paper and is the subject of 
future work. 

5 Analysis of the Pool Mix 

Recall from Section IZ. II that a pool mix stores n messages and receives N mes- 
sages every round. It then puts together the stored and received messages and 
outputs JV of them (chosen randomly) . The remaining n messages are stored for 
the next round. On round zero the mix stores n “dummy messages” that look 
to outsiders as if they were real, but were created by the mix itself. 

First, we calculate the sender anonymity set and the sender anonymity set 
size. Denote the anonymity sets associated with the N messages arriving at 
round i, Ki^i . . . and let Ki = Ki^i U . . . U Ki^iq. Now the sender anonymity 
set of the outgoing messages after round k will be the union of the anonymity 
sets of the messages stored in the mix (all of which are the same and are equal to 
the anonymity sets of all the messages which left the mix at the previous round) 
and the messages which arrived from the network. 



Now, assume that all of the messages arrive at the pool mix directly from 
senders, which are all different from each other. Formally, Vi, j, fc, l.{i yf J V fc yf 
1) => Ki j y^ RT/c,/. This implies that the size of the set after round k is 



It is clear that the size of A]~ does not provide us with a useful insight on how 
well this mix performs. In particular, it does not capture the fact that senders 
of past rounds have smaller probabilities of being the senders of messages that 
come out of the mix at the last round. Finally, this metric does not allow us to 
compare the pool mix with other mixing architectures, including conventional 
threshold mixes. 

We therefore compute the effective size, based on the entropy, of the sender 
anonymity set. 

The probability that a message which comes out of the mix at round k was 
introduced by a sender in the mix at round 0 < a; < fc is 



Aq = {mix} 
Ai = S,-iUK^ 



\Ak\ — N X k 1 



and for fc — >■ oo 



lim |A| — >■ oo 
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Definition 3. Now, assume that each message was coming directly from a 
sender and all senders only send one message. Note that after round 0, the 
only sender involved is the mix itself. The effective size of the sender anonymity 
set for round k is 




After a large number of rounds (k ^ oo) the above expression of the effective 
size converges towards 

( Tl \ Tl 

iV / iV 

The effective size of the set provides us with useful information about how 
the mix is performing. As one would expect if there is no pool then the effective 
sender anonymity set is the same as for a threshold mix architecture with N 
inputs. 

Example 1. When there is no pool (n = 0) the effective anonymity set size is 

lim E = log N 

k—¥oo 

Example 2. When only one message is fed back to the mix (n = 1) 
lim E = (l + ^ ) log {N + 1) 

k—¥oo Y ^ J 

So a mix of this type that takes N = 100 inputs will have an effective size 
of limfc_>oo E = 6.725. This is equivalent to a threshold mix that takes « 106 
inputs. 

Example 3. A pool mix with N = 100 inputs out of which n = 10 are fed back 
will have an effective size of limfc_>oo E = 7.129. That is equivalent to a threshold 
mix with N = 2^-^^’^ « 140 inputs. 

The additional anonymity that the pool mix provides is not “for free” since 
the average latency of the messages increases from 1 round to 1 + ^ rounds with 

a variance of — yp — —. 

6 Discussion 



Let us now examine the scenarios in which our analysis may be useful and 
demonstrate that one would not be able to use other well-known attacks to 
compromise anonymity. 
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The new entropy measure of anonymity is useful in analysing mix networks, 
rather than cascades or DC nets where there is no possibility of members of 
anonymity sets having different probabilities of taking on particular roles. The 
route length techniques are applicable in mix network systems which have a 
maximum route length constraint such as Mixmaster |MC()()| . 

It is worth mentioning that a similar information theoretic metric was in- 
dependently proposed in and used to compare different anonymity 

systems. Here we concentrate on using it for analysing mix systems and show 
how it can be used to express new attacks. 

7 Conclusion 

We have demonstrated serious problems with using the notion of anonymity set 
for measuring anonymity of mix-based systems. In particular, we exhibited the 
pool mix as an illustration of the fact that we cannot always use anonymity sets 
to compare the effectiveness of different mix flushing algorithms. 

We have also proposed an information-theoretic metric based on the idea of 
anonymity probability distributions. We showed how to calculate them and used 
the metric to compare the pool mix to more traditional mixes. 

We must note, however, that our new metric does not really deal with the 
knowledge vulnerability problem discussed in Section 12.21 We feel that additional 
structure to enrich the notion of anonymity sets and enable better analysis of 
knowledge-based vulnerabilities is needed. However, having introduced proba- 
bilities into the model, we want to go on and develop a framework capable of 
answering questions like “What happens to the anonymity probability distribu- 
tion of receiver S when the attacker knows that A is communicating to P with 
probability | or i? with probability |?’iU. This is the subject of future work. 

We then showed that care must be taken when calculating anonymity proba- 
bility distributions as the same attacks as used against the anonymity set metric, 
would also apply here. In particular, we demonstrated that if maximum route 
length in a mix system exists, it is known to the attacker and can be used ex- 
tract additional information and gain knowledge which was impossible to express 
using anonymity sets. 

We feel that more sophisticated probabilistic metrics of anonymity should be 
developed. Moreover, perhaps, if combined with knowledge of the communication 
protocols executed by the sender and recipient, they can yield powerful attacks 
against mix-based systems. Moreover, we feel that in a subject like anonymity, 
formal reasoning is essential if strong guarantees are to be provided. Yet another 
direction is relating the above to unlinkability and plausible deniability. All these 
are subjects of future work. 
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Abstract. This paper introduces an information theoretic model that 
allows to quantify the degree of anonymity provided by schemes for 
anonymous connections. It considers attackers that obtain probabilis- 
tic information about users. The degree is based on the probabilities an 
attacker, after observing the system, assigns to the different users of the 
system as being the originators of a message. As a proof of concept, the 
model is applied to some existing systems. The model is shown to be 
very useful for evaluating the level of privacy a system provides under 
various attack scenarios, for measuring the amount of information an at- 
tacker gets with a particular attack and for comparing different systems 
amongst each other. 



1 Introduction 

In today’s expanding on-line world, there is an increasing concern about the 
protection of anonymity and privacy in electronic services. In the past, many 
technical solutions have been proposed that hide a user’s identity in various ap- 
plications and services. Anonymity is an important issue in electronic payments, 
electronic voting, electronic auctions, but also for email and web browsing. 

A distinction can be made between connection anonymity and data anonymi- 
ty. Data anonymity is about filtering any identifying information out of the data 
that is exchanged in a particular application. Connection anonymity is about 
hiding the identities of source and destination during the actual data transfer. 
The model presented in this paper focuses on the level of connection anonymity 
a system can provide, and does not indicate any level of data anonymity. 

Information theory has proven to be a useful tool to measure the amount of 
information (for an introduction, see Cover and Thomas 0). We try to measure 
the information obtained by the attacker. In this paper, a model is proposed, 
based on Shannon’s definition of entropy m, that allows to quantify the degree 
of anonymity of an electronic system. This degree will be dependent on the power 
of the attacker. The model is shown to be very useful to evaluate the anonymity 
a system provides under different circumstances, to compare different systems, 
and to understand how a system can be improved. 
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1.1 Related Work 

To our knowledge, there have been several attempts to quantify the degree of 
anonymity of a user provided by an anonymous connection system. 

Reiter and Rubin 0 define the degree of anonymity as 1 — p, where p is the 
probability assigned to a particular user by the attacker. We believe that this 
degree is useful to get an idea of the anonymity provided by the system to the user 
who is in the worst case, but it does not give information on how distinguishable 
the user is within the anonymity set. For a system with a large number of possible 
senders the user who is in the worst case may have an assigned probability that 
is less than 1/2 but still be distinguishable by the attacker because the rest of 
the users have very low associated probabilities. 

Berthold et al. P] define the degree of anonymity as A = log2{N), where N 
is the number of users of the system. This degree only depends on the number of 
users of the system, and does not take into account the information the attacker 
may obtain by observing the system. Therefore, it is not useful to measure the 
robustness of the system towards attacks. The degree we propose in this paper 
measures the information the attacker gets, taking into account the whole set of 
users and the probabilistic information the attacker obtains about them. 

Wright et al. analyze the degradation of anonymous protocols in ^2|- They 
assume that there is a recurring connection between the sender of a message an 
the receiver. 

An anonymity measurement model similar to the one proposed in this paper 
has been independently proposed by Serjantov and Danezis in m- The main 
difference between the two models is that their system does not normalize the 
degree in order to get a value relative to the anonymity level of the ideal system 
for the same number of users. 



1.2 Outline of the Paper 

This paper is organized as follows: Section 2 describes the system and attack 
model; the actual measurement model is then proposed in Section 3. As a proof 
of concept, this model is applied to some existing systems in Section 4. Finally, 
our conclusions and some open problems are presented. 



2 System Model 

In this paper we focus on systems that provide anonymity through mixes. The 
system model we consider, thus consists of the following entities: 

Senders. These are users who send (or have the ability to send) messages to 
recipients. These messages can be emails, queries to a database, requests of web 
pages, or any other stream of data. The senders can be grouped into the set 
of senders, that is also called the anonymity set. These are the entities of the 
system whose anonymity we want to protect. 
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During the attack, we consider the number of senders constant, and senders 
behaving as independent, identical Poisson processes. This is a standard assump- 
tion for modeling the behavior of users making phone calls 0. This means that 
all users send, in average, the same amount of messages, and the interval of time 
between one message and the next one follows an exponential distribution. 

Recipients. These are the entities that receive the messages from the senders. 
Recipients can be active (if they send back answers to the senders) or passive 
(if they do not react to the received message). Depending on the system there 
is a large variety of recipients. Some examples are web servers, databases, email 
accounts or bulletin boards where users can post their messages. The attacker 
may use the reply messages to gain information. 

Mixes. These are the nodes that are typically present in solutions for anony- 
mous connections. They take messages as input, and output them so that the 
correlation with the corresponding input messages is hidden. There are many 
different ways to implement a mix; if more than a single mix is used (which is 
usually done in order to achieve better security), there are several methods to 
route the message through a chain of mixes; a summary can be found in m 
In some of the systems, e.g.. Crowds, the nodes do not have mixing properties 
as the ones described by Chaum (Sj. In these cases the actual properties of the 
intermediate nodes will be mentioned. 

Note that in some systems the intersection between the different sets might 
be non-empty (e.g., a sender could be at the same time a recipient or a mix). 

Examples of systems that provide anonymous connections are Crowds j0| and 
Onion Routing |B|. The proposed measurement model is shown to be suitable 
for these systems. It is however generally applicable to any kind of system. 

2.1 Attack Model 

The degree of anonymity depends on the probabilities that the users have sent a 
particular message; these probabilities are assigned by the attacker. The degree 
is therefore measured with respect to a particular attack: the results obtained for 
a system are no longer valid if the attack model changes. Concrete assumptions 
about the attacker have to be clearly specified when measuring the degree of 
anonymity. 

We briefly describe the attacker properties we consider: 

— Internal-External: An internal attacker controls one or several entities that 
are part of the system (e.g., the attacker can prevent the entity from sending 
messages, or he may have access to the internal information of the entity); 
an external attacker can only compromise communication channels (e.g., he 
can eavesdrop or tamper with messages). 

— Passive-Active: A passive attacker only listens to the communication or reads 
internal information; an active attacker is able to add, remove and modify 
messages or adapt internal information. 
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— Local- Global: A global attacker has access to the whole communication sys- 
tem, while a local attacker can only control part of the resources. 

Different combinations of the previous properties are possible, for instance a 
global passive external attacker is able to listen to all the channels, while a local 
internal active attacker can control, for example, a particular mix, but is unable 
to get any other information. 

In our model, an attacker will carry out a probabilistic attack. It has been 
pointed out by Raymond in [7] that these attacks have not been thoroughly 
addressed so far. With such an attack, the adversary obtains probabilistic infor- 
mation of the form with probability p, A is the sender of the message. 



3 Proposed Measurement Model 

First of all, we should give a precise definition of anonymity. In this paper we 
adopt the definition given by Pfitzmann and Kohntopp in 0. Anonymity is the 
state of being not identifiable within a set of subjects, the anonymity set. A sender 
is identifiable when we get information that can be linked to him, e.g., the IP 
address of the machine the sender is using. 

In this paper we only consider sender anonymity. This means that for a 
particular message the attacker wants to find out which subject in the anonymity 
set is the originator of the message. The anonymity set in this case is defined as 
the set of hones^ users who might send a message. It is clear that the minimum 
size of the anonymity set is 2 (if there is only one user in the anonymity set it is 
not possible to protect his identity). 

Our definition for the degree of anonymity is based on probabilities: after 
observing the system, an attacker will assign to each user a probability of being 
the sender. 

3.1 Degree of Anonymity Provided by the System 

According to the previous definitions, in a system with N users, the maxi- 
mum degree of anonymity is achieved when an attacker sees all subjects in 
the anonymity set as equally probable of being the originator of a message. 
Therefore, in our model the degree of anonymity depends on the distribution of 
probabilities and not on the size of the anonymity set, in contrast with previous 
work PEI- This way, we are able to measure the quality of the system with 
respect to the anonymity it provides, independently from the number of users 
who are actually using it. Nevertheless, note that the size of the anonymity set 
is used to calculate the distribution of probabilities, given that the sum of all 
probabilities must be 1. 

The proposed model compares the information obtained by the attacker after 
observing the system against the optimal situation, in which all honest users 

^ Users controlled by the attacker are not considered as part of the anonymity set, 
even if they are not aware of this control. 
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seem to be equally probable as being the originator of the message, that is, in 
a system with N users, the situation where the attacker sees all users as being 
the originator with probability 1/N. 

After observing the system for a while, an attacker may assign some probabil- 
ities to each sender as being the originator of a message, based on the information 
the system is leaking, by means of traffic analysis, timing attacks, message length 
attacks or more sophisticated attacks. 

For a given distribution of probabilities, the concept of entropy in information 
theory provides a measure of the information contained in that distribution 
We use entropy as a tool to calculate the degree of anonymity achieved by the 
users of a system towards a particular attacker. The entropy of the system after 
the attack is compared against the maximum entropy (for the same number of 
users). This way we get an idea of how much information the attacker has gained, 
or, in other words, we compare how distinguishable the sender is within the set 
of possible senders after the attack. 

Lex X be the discrete random variable with probability mass function pi = 
Pr{X = i), where i represents each possible value that X may take. In this case, 
each i corresponds to an element of the anonymity set (a sender). We denote 
by H{X) the entropy of the system after the attack has taken place. For each 
sender belonging to the senders set of size N, the attacker assigns a probability 
Pi- H{X) can be calculated as: 

N 

H{X) = -'^p^log2{p^) . 

i=l 

Let Hm be the maximum entropy of the system we want to measure, for the 
actual size of the anonymity set: 

Hm = log2(A^) , 

where N is the number of honest senders (size of the anonymity set) . 

The information the attacker has learned with the attack can be expressed 
as Hm — H{X). We divide by Hm to normalize the value. We then define the 
degree of anonymity provided by the system as: 

^ ^ Hm- H{X) ^ H{X) 

Hm Hm 

For the particular case of one user we assume d to be zero. 

This degree of anonymity provided by the system quantifies the amount of 
information the system is leaking. If in a particular system a user or a small 
group of users are shown as originators with a high probability with respect to 
the others, this system is not providing a high degree of anonymitjfl. 

^ On the other hand, note that any system with equiprobable distribution will provide 
a degree of anonymity of one, therefore a system with two senders will have d = 1 if 
both of them are assigned probability 1/2. This is because the definition of anonymity 
we are using is independent of the number of senders. 
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It follows immediately that 0 < d < 1: 

— d = 0 when a user appears as being the originator of a message with proba- 
bility 1. 

— d = 1 when all users appear as being the originator with the same probability 

{p^ = 1 /^). 

4 Measuring the Degree of Anonymity Provided 
by Some Systems 

In this section we apply our proposed measurement model in order to analyze 
the degree of anonymity provided by some existing systems, in particular Crowds 
and Onion Routing. 



4.1 A Simple Example: Mix Based Email 

As a first example, let us consider the system shown in Fig. H Here we have a 
system that provides anonymous email with 10 potential senders, a mix network 
and a recipient. The attacker wants to find out which of the senders sent an email 
to this particular recipient. By means of timing attacks and traffic analysis, the 
attacker assigns a certain probability to each user as being the sender. The aim 
of this example is to give an idea on the values of the degree of anonymity for 
different distributions of probabilities. 




recipient 



Fig. 1. A simple example of a mix based email system 



Active Attack. We first consider an active internal attacker who is able to control 
eight of the senders (that means that these eight users have to be excluded from 
the anonymity set). He is also able to perform traffic analysis in the whole mix 
network and assign probabilities to the two remaining senders. Let p be the 
probability assigned to user 1 and 1 — p the probability assigned to user 2. 

The distribution of probabilities is: 

Pi=P P2 = l-P , 
and the maximum entropy for two honest users is: 



Hm = log2{2) = 1 . 
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(a) (b) 



Fig. 2. Degree of anonymity for a simple example 

In Fig. we show the variation of the degree of anonymity with respect to 
p. As we could expect from the definitions, we see that d reaches the maximum 
value {d = 1) when both users are equiprobable {p = 1/2). Indeed, in this case 
the attacker has not gained any information about which of the two active users 
is the real sender of the message by analyzing the traffic in the mix network. 
The minimum level (d = 0) is reached when the attacker can assign probability 
one to one of the users (p = 0 or p = 1). 

This simple example can be useful to get an idea on the minimum degree of 
anonymity that is still adequate. Roughly, we suggest that the system should 
provide a degree d > 0.8. This corresponds to p = 0.25 for one user and p = 0.75 
for the other. In the following examples, we will again look at the probability 
distributions that correspond to this value of the degree, in order to compare the 
different systems. Nevertheless, the minimum acceptable degree for a particular 
system may depend on the anonymity requirements for that system, and we 
believe that such a minimum cannot be suggested before intensively testing the 
model. 

Passive Attaek. We now consider a passive global external attacker who is able 
to analyze the traffic in the whole system, but who does not control any of the 
entities (the anonymity set is, therefore, composed by 10 users). The maximum 
entropy for this system is: 

Hm = /op2(10) . 

The attacker comes to the following distribution: 

Pj = |, l<f<3; p^ = > 4 < i < 10 . 

In this case we have two groups of users, one with three users and the other 
one with seven. Users belonging to the same group are seen by the attacker as 
having the same probability. 

In Fig. Eb we can see the variation of d with the parameter p. The maximum 
degree d = 1 is achieved for the equiprobable distribution (p = 0.3). In this case 
d does not drop to zero because in the worst case, the attacker sees three users 
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as possible senders with probability p = 1/3, and therefore he cannot identify a 
single user as the sender of the message. The reference value of d = 0.8 is reached 
when three of the users are assigned probability pi = 0.25, and the remaining 
seven users are assigned probability Pi = 0.036. 

4.2 Crowds 

Overview of the System. Crowds P] is designed to provide anonymity to users 
who want to access web pages. To achieve this goal, the designers introduce the 
notion of “blending into a crowd” : users are grouped into a set, and they forward 
requests within this set before the request is sent to the web server. The web 
server cannot know from which member the request originated, since it gets the 
request from a random member of the crowd, that is forwarding the message 
on behalf of the real originator. The users (members of the crowd) are called 
jondos. 

The system works as follows: when a jondo wants to request a web page it 
sends the request to a second (randomly chosen) jondo. This jondo will, with 
probability pf, forward the request to a third jondo (again, randomly chosen), 
and will, with probability (1— p/) submit it to the server. Each jondo in the path 
(except for the first one) chooses to forward or submit the request independently 
from the decisions of the predecessors in the path. 

Communication between jondos is encrypted using symmetric techniques, 
and the final request to the server is sent in clear text. Every jondo can observe 
the contents of the message (and thus the address of the target server), but it 
cannot know whether the predecessor is the originator of the message or whether 
he is just forwarding a message received by another member. 

Note that for this system the mixes are the jondos, and they do not have 
some of the expected characteristics. In particular, they do not make any effort 
to hide the correlation between incoming and outgoing messages. 

Attacker. In this paper we calculate the degree of anonymity provided by Crowds 
with respect to collaborating crowd members, that is, a set of corrupted jondos 
that collaborate in order to disclose the identity of the jondo that originated the 
request. The assumptions made on the attacker are: 

— Internal: The attacker controls some of the entities that are part of the 
system. 

— Passive: The corrupted jondos can listen to communication. Although they 
have the ability to add or delete messages, they will not gain extra informa- 
tion about the identity of the originator by doing so. 

— Local: We assume that the attacker controls a limited set of jondos, and he 
cannot perform any traffic analysis on the rest of the system. 

Degree of Anonymity. Figure 01 shows an example of a crowds system. In this 
example the jondos I and 2 are controlled by the attacker, i.e., they are collab- 
orating crowd members. A non-collaborating jondo creates a path that includes 
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Fig. 3. Example of a Crowds system with 7 jondos 



at least one corrupted jond^. The attacker wants to know which of the non- 
collaborating jondos is the real originator of the message. 

Generally, let N be the number of members of the crowd, C the number of 
collaborators, pf the probability of forwarding and Pi the probability assigned 
by the attacker to the jondo i of having sent the message. The jondos under the 
control of the attacker can be excluded from the anonymity set. The maximum 
entropy Hm, taking into account that the size of the anonymity set is N — C, is 
equal to: 

Hm = log2 {N -C) . 

From P] we know that, under this attack model, the probability assigned to 
the predecessor of the first collaborating jondo in the path (let this jondo be 
number C+1) equals: 



N -pf{N -C -1) N-C-l 

'’°+' = N = ' N 



The probabilities assigned to the collaborating jondos remain zero, and assuming 
that the attacker does not have any extra information about the rest of non- 
collaborators, the probabilities assigned to those members are: 



^ 1 - PC+i ^ ^ 

N-C-l N 



C + 2<i<N . 



Therefore, the entropy of the system after the attack will be: 



H{X) = 



N -pf{N -C -1) 
N 



log2 



N 

N -pf{N -C -1) 



N-C-l, 

+ Pf log2 



N 

Pf 



The degree of anonymity provided by this system is a function of N, C and pf. 
In order to show the variation of d with respect to these three parameters we 
chose Pf = 0.5 and pj — 0.75, and N = 5 (Fig. ^), N = 20 (Fig. 03) and 
N= 100 (Fig.Et). The degree d is represented in each figure as a function of the 
number of collaborating jondos C . The minimum value of C is 1 (if C = 0 there 
is no attacker), and the maximum value of C is iV — 1 (if C = there is no user 
to attack) . For the case C = fV — 1 we obtain d = 0 because the collaborating 
jondos know that the real sender is the remaining non-collaborating jondo. We 

® If the path does not go through a collaborating jondo the attacker cannot get any 
information. 
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(c) 

Fig. 4. Degree of anonymity for Crowds 



can deduce from the figures that d decreases with the number of collaborating 
jondos and increases with pj. The variation of d is very similar for systems 
with different number of users. Regarding the tolerated number of collaborating 
jondos to obtain d > 0.8, we observe that for p/ = 0.5 the system does not 
tolerate any corrupted jondo] for Pf = 0.75 the system tolerates: for = 5 
users, C < 1, for N = 20 users, C < 4, and for N = 100 users, C < 11. 

In 1^ a degree of anonymity is defined as (1 — p sender) ^ where p sender is the 
probability assigned by the attacker to a particular user as being the sender. 
This measure gives an idea of the degree of anonymity provided by the system 
for a particular user, and it is complementary with the degree proposed in this 
paper. It is interesting to compare the results obtained by Reiter and Rubin 
in 0 with the ones obtained in this paper (for the same attack model): they 
consider that the worst acceptable case is the situation where one of the jondos 
is seen by the attacker as the sender with probability 1/2. Therefore, they come 
to the conclusion that, for pf = 0.75, the maximum number of collaborating 
jondos the system can tolerate is C < N/3 — 1. For the chosen examples we 
obtain: for iV = 5 users, C = 0, for N — 20 users, C < 5, and for N = 100 users, 
C < 32. 

Degree of Anonymity from the Point of View of the Sender. We have calculated 
the degree of anonymity of a user who sends a message that goes through a 
corrupted jondo, but this only happens with probability C/N each time the 
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message is forwarded to another jondo. We have to take into account that the 
first jondo always forwards the message to a randomly chosen jondo of the 
crowd, and subsequent jondos forward with probability Pf to another jondo, 
independently from previous decisions. The probability pn of a message going 
only through honest jondos is: 



N-C ,^ ,^( N-C C 

N ^ N N-pf(N-C) ■ 

i=0 

If a message does not go through any collaborating jondo, the attacker will 
assign all honest senders the same probability, pi = 1/{N — C), and the degree 
of anonymity will be d = 1 (the maximum degree is achieved because the at- 
tacker cannot distinguish the sender from the rest of honest users). Some further 
discussion about the implications of this fact can be found in the Appendix A. 



4.3 Onion Routing 

Overview of the System. Onion Routing |H| is a solution for application-inde- 
pendent anonymous connections. The network consists of a number of onion 
routers. They have the functionality of ordinary routers, combined with mixing 
properties. Data is sent through a path of onion routers, which is determined by 
an onion. 

An onion is a layered encrypted data structure, that is sent to an onion 
router. It defines the route of an anonymous connection. It contains the next hop 
information, key seed material for generating the symmetric keys that will be 
used by the onion router during the actual routing of the data, and an embedded 
onion that is sent to the next onion router. 

The data is encrypted multiple times using the symmetric keys that were 
distributed to all the onion routers on the path. It is carried by small data 
cells containing the appropriate anonymous connection identifier. Each onion 
router removes/adds a layer of encryption (using the symmetric keys, generated 
from the key seed material in the onion) depending on the direction of the data 
(forwards/backwards) . 

Attack Model. Several attack models have been described by Reed, Syverson 
and Goldschlag in [B|. In this example we consider an attacker who is able to 
narrow down the set of possible paths. The attacker obtains, as a result of the 
attack, a subset of the anonymity set that contains the possible senders. We do 
not make any assumption on the attacker, but that he does not control any user 
of the system. We make abstraction of the attack, but, in order to illustrate the 
example, it could be carried out performing a brute force attack, starting from 
the recipient and following all the possible reverse paths to the senders. Another 
alternative is that the attacker controls some of the onion routers, and he is able 
to eliminate a group of users from the anonymity set. 
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Fig. 5. Example of Onion Routing 



Degree of Anonymity. Figure Ogives an example of an Onion Routing system. 
There are in total seven users in this system. We assume that the attacker 
managed to exclude users 6 and 7 from the set of possible senders. 

Generally, let N be the size of the anonymity set; the maximum entropy for 
N users is: 



Hm = log^iN) . 

The attacker is able to obtain a subset of the anonymity set that contains the 
possible senders. The size of the subset is S' (1 < S' < iV). We assume that 
the attacker cannot assign different probabilities to the users that belong to this 
subset: 



Pj = — , l<f<S; pi = 0, S+l<j<iV. 
o 

Therefore, the entropy after the attack has taken place, and the degree of 
anonymity are: 



H{X) = log2(S) 



H{X) _ log2(S) 
Hm log2(A^) 



Figure 0 shows the degree of anonymity with respect to S for N = 5, N = 20 
and N = 100. Obviously, d increases with S, i.e., when the number of users that 
the attacker is able to exclude from the anonymity set decreases. In order to 
obtain d > 0.8: for = 5 users, we need S > 3; for N = 20 users, we need 
S > 12; and for N = 100 users, we need S > 40. 

When comparing N — S to the number of collaborating jondos C in the 
Crowds system, it seems that Onion Routing is much more tolerant against 
‘failing’ users/ jondos than Crowds. This is because the remaining ‘honest’ 
users /jondos have equal probability (for this attack model) in the Onion Routing 
system, while in Crowds there is one jondo that has a higher probability than 
the others. 
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Fig. 6. Degree of anonymity for Onion Routing 



5 Conclusions and Open Problems 

Several solutions for anonymous communication have been proposed and im- 
plemented in the past. However, the problem of how to measure the actual 
anonymity they provide, has not yet been studied thoroughly. We proposed a 
general measurement model to quantify the degree of anonymity provided by a 
system in particular attack circumstances. We applied our model to some exist- 
ing solutions for anonymous communication. We suggested a intuitive value for 
the minimum degree of anonymity for a system to provide adequate anonymity. 
The model showed to be very useful for evaluating a system, and comparing 
different systems. 

In the examples we have chosen, we calculate the degree for a particular 
message, and we do not take into account the behavior of the system over time. 
However, the attacker may gain useful information by observing the system for 
a longer time, and this fact is reflected in the distribution of probabilities. We 
could apply the model taking into account these changes in the probabilities, 
and we would obtain information on the evolution of the degree of anonymity 
with the time. 

There are still some open problems. Our model is based on the probabilities 
an attacker assigns to users; finding this probability distribution in real situations 
is however not always easy. 

It would be also interesting to take into account the a priori information the 
attacker may have, and use the model to see the amount of information he has 
gained with the attack. 

The paper only focused on sender anonymity; recipient anonymity can be 
treated analogously; unlinkability between any sender and any recipient depends 
on the probability of finding a match. 

Finally, the usefulness of our model should be more intensively tested; for 
example, it would be interesting to measure the effect of dummy traffic in the 
more advanced anonymous communication solutions, in order to find the right 
balance between performance and privacy. 
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A Extension of the Model 

In some systems we may get different distributions with a certain probability. 
For example, in Crowds, there are two cases: the message goes through a cor- 
rupted jondo with probability pc, and it goes only through honest jondos with 
probability pn, where: 

C . _ 1 

~ N-pf{N-C) PH- ~ iV-p^(fV-C') ■ 

If we want to calculate the degree of anonymity offered by the system taking 
into account all possibilities, we may combine the obtained degrees as follows: 

K 

i=i 

where dj is the degree obtained under particular circumstances and pj the prob- 
ability of occurrence of such circumstances. K is the number of different possi- 
bilities. 

The degree of anonymity becomes in this case a composite of the degrees 
obtained for the different cases. 

B Alternative Solution 

It may be the case that, for a particular system, a requirement on the minimum 
acceptable degree of anonymity is formulated as users should have at least a 
degree of anonymity equivalent to a system with M users and perfect indistin- 
guishability. 

In this case, we could compare the actual entropy of the system against the 
required one. We should then compare the obtained entropy, H{X) and log 2 {M), 
instead of normalizing by the best the system can do with the number of current 
users. If H{X) is bigger, then the system is above the minimum; if it is smaller, 
we may want to use some extra protection in the system, such as dummy traffic. 

This might be useful to see if the system is meeting the requirements or not, 
and to launch an alarm in case the degree of anonymity is lower than the one 
defined as the minimum. 
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Abstract. Enterprises collect a large amount of personal data about their cus- 
tomers. Even though enterprises promise privacy to their customers using privacy 
statements or P3P, there is no methodology to enforce these promises throughout 
and across multiple enterprises. This article describes the Platform for Enterprise 
Privacy Practices (E-P3P), which defines technology for privacy-enabled manage- 
ment and exchange of customer data. Its comprehensive privacy-specific access 
control language expresses restrictions on the access to personal data, possibly 
shared between multiple enterprises. E-P3P separates the enterprise-specific de- 
ployment policy from the privacy policy that covers the complete life cycle of 
collected data. E-P3P introduces a viable separation of duty between the three 
“administrators” of a privacy system: The privacy officer designs and deploys 
privacy policies, the security officer designs access control policies, and the cus- 
tomers can give consent while selecting opt-in and opt-out choices. 



1 Introduction 

Consumer privacy is a growing concern in the marketplace. Whereas privacy concerns 
are most prominent in the context of e-commerce, they are increasing for traditional 
transactions as well. Some enterprises are aware of these problems and of the market 
share they might lose unless they implement proper privacy practices. As a consequence 
enterprises publish privacy statements that promise fair information practices. Written in 
natural language or formalized using P3P im, they merely constitute privacy promises 
and are not necessarily backed up by technological means. 

In this article, we describe the Platform for Enterprise Privacy Practices (E-P3P). E- 
P3P defines technology that enables an enterprise to enforce the privacy promises made 
to its customers. It solves some of the most prominent privacy issues of enterprises that 
collect data from their customers: 

- Enterprises store a variety of personally identifiable information (PII or personal 
data for short). Larger enterprises may not even know what types of personal data 
are collected and where it is stored. 

- Enterprises may not know the consent a customer has given nor the legal regulations 
that apply to a specific customer record. 

- Enterprises exchange customer data. Enterprises that process or store data collected 
by another enterprise are unable to enforce privacy consistently on behalf of the 
collecting enterprise. 
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Whenever an enterprise collects, stores, or processes personal information, E-P3P can 
be used to ensure that the data flows and usage practices of an enterprise comply with 
the privacy statement of that enterpris^. E-P3P can be used in the following areas: 

- Formalized Privacy Policies: E-P3P enables an enterprise to formalize a privacy 
policy into a machine-readable language that can be enforced automatically. The 
natural text version is inspected by the customers whereas the machine-readable 
version is used for enforcement within the enterprise. 

- Formalized Policy Options: An E-P3P privacy policy can identify opt-in as well as 
opt-out choices or options that depend on the collected data (e.g., whether the given 
data pertains to a child). These options enable a company to use a limited number 
of policies while still providing freedom of choice to its customers. 

- Customer Consent Management: A privacy policy can be regarded as a contract 
between the individual customer and an enterprise. As a consequence, a customer 
needs to authorize the applicable policy as well as any applicable opt-in and opt-out 
choices that the policy offers. This requires recording of consent on a per-customer 
basis. 

- Policy Enforcement: Given collected personal data and its policy, the policy needs to 
be enforced. Policy enforcement covers several cooperating enterprises if personal 
data is exchanged among them. The core technology is a scheme for privacy-enabling 
access control that allows only actions that are authorized by the applicable privacy 
policy. Besides granting or denying access, privacy obligations have to be enforced 
as well (such as “we delete collected data if consent is not given within 15 days”). 

- Compliance Audit: The handling of personal data should comply with the privacy 
policy in an auditable way. This enables a privacy officer to verify that the data was 
handled properly. 

Note that this is only the technical core of privacy-enabled customer data management. 

Another important building block is to provide additional customer privacy services. 
Customers should be enabled to inspect and update the data and usage logs stored about 
them. In addition, an enterprise may offer the option to delete the personal data. Ideally, 
customers should retain maximum control over their data. Once a privacy-management 
scheme has been implemented, it needs to be audited by external parties that are trusted 
by the customers. Together with resulting privacy seals, this can increase the trust of the 
consumers. Customer privacy requires secure systems. As a consequence, enterprises 
must implement continuous business processes to keep their systems secure. Owing to 
space restrictions we will not elaborate on these services in this article. 

The rest of this paper is structured as follows: In Section El we describe existing 
work related to privacy-enabled data management. In SectionOl we describe the E-P3P 
scheme and architecture for privacy-enabled customer data management. In Section 01 
we take a closer look at the language for formalizing privacy policies as well as at the 
logic for evaluating privacy policies. We conclude in Section 0 A privacy policy for a 
hypothetical online bookstore called Borderless Books is given in Appendix El 

* Note that our scheme only protects against systematic privacy violations within the system. For 
example, it cannot prevent misuse by an employee with legitimate access. 
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2 Related Work 

The Platform for Privacy Preferences (P3P) standard of W3C Ol enables a Web site to 
declare what kind of data is collected and how this data will be used. A P3P policy may 
contain the purposes, the recipients, the retention policy, and a textual explanation of 
why this data is needed. P3P defines standardized categories for each kind of information 
included in a policy. Compared to P3P, our model defines the privacy practices that are 
implemented within an enterprise. As this depends on internal structures of the com- 
pany, it results in more detailed policies that can be enforced and audited automatically. 
Note that our policies can use P3P-compatible terminology to easily check that the P3P 
promises correspond to the enterprise-internal policies. 

Current access control systems 111 21 only check whether a user is allowed to perform 
an action on an object. In fSj, Fischer-Hiibner augmented a task-based access control 
model with the notion of purpose and consent. Data can only be accessed in a controlled 
manner by executing a task. A user can access personal data if this access is necessary to 
perform the current task and the user is authorized to execute this task. In addition, the 
task’s purpose must correspond to the purposes for which the personal data was obtained 
or there has to be consent by the data subjects. This work is the first complete model 
of privacy we are aware of. However, the model does not consider context-dependent 
access control nor obligations and is restricted to a single enterprise. 

A language for use-based restrictions that allows one to state under which conditions 
specific data can be accessed has been developed by Bonatti et al. fl- In their language, 
a data user is characterized as the triple user, project, and purpose. Projects are named 
activities registered at the server, for which different users can be subscribed, and which 
may have one or more purposes. Conditions are used to define constraints that must be 
satisfied for the request to be granted. 

Mandatory and discretionary access controls do not handle environments in which 
the originators of documents retain control over them after those documents have been 
disseminated. In ca, McCollum et al. define Owner-Retained Access Control (ORAC) 
that provides a stringent, label-based alternative to discretionary access control. This is 
of interest for user communities where the original owners of data need to retain right 
to the data as it propagates through copying, merging, or being read by a subject that 
may later write the data into other objects. The originator-controlled access control 
(ORGCON) policy |jlj limits the authority of recipients of information to use or copy it. 

The concept of provisional authorization |7H| shares similar objectives with privacy 
obligations. Added to the access decision, provisions are a kind of annotation that specify 
necessary actions to be taken. Modeled as a sequence of secondary access requests, they 
are executed by the user and/or the system under the supervision of the access control 
system. 

Our concept of bundling data and policy is similar to the concept used in XACL H . 
An XACL document contains an access control policy for a particular XML document 
as well as the document to which the access shall be restricted. 

A data format for disclosing customer profile data between enterprises has been 
defined by CPexchange |4|- CPexchange uses P3P-like privacy statements to define the 
policy accompanying the disclosed profile. 
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3 Platform for Enterprise Privacy Practices 

The Platform for Enterprise Privacy Practices (E-P3P) is a scheme for privacy-enabled 
management of customer data. Its core is an authorization scheme that defines how 
collected data may be used. 

3.1 Application Model and Prerequisites 

In E-P3P, an enterprise runs legacy applications that use collected data. Each application 
can perform certain tasks. For example, a “customer relationship management system 
(CRM)” application may perform the tasks “create new customer record” or “update 
existing customer record”. 

Enterprise privacy policies reflect the authorized flow and usage of personal infor- 
mation within an enterprise. As a consequence, the flows and usages have to be identified 
in order to use the E-P3P system: 

- A business-process model for the collection and use of customer data defines the 
scope of the data management system. The business-process model identifies the 
players that use collected data, the data they use, and how and for what purposes they 
use the data. The business model is formalized as the declaration of data, players 
and operations of an E-P3P privacy policy. 

- A collection of informal privacy policies that govern the use of personal data in 
the business processes. They can be structured as bilateral privacy agreements that 
describe how data that is sent from one player to another may be used. Informal 
privacy policies are formalized as E-P3P privacy policy rules. 



3.2 Policies and Separation of Duties 

Privacy and security authorization in an enterprise involves at least four types of players. 
The data subjects are the players about whom personal data is collected. The most com- 
mon data subjects are the customers of an enterprise. Other data subjects are employees 
or customers of cooperating enterprises. The next players are the data users within an 
enterprise who use collected data by executing tasks of applications. The other two play- 
ers are the privacy officer (PO), who is responsible for privacy services, and the security 
officer (SO), who is responsible for security services. 

E-P3P introduces the following intermediate abstractions and the corresponding 
policies (see FigureQ]) in order to separate the duties of these players: 

1 . Personal data is collected in forms. A form is a set of fields, and fields have a 
type (e.g., string) as well as a PII type (e.g., “medical record”, “address data”, or 
“order data”). A form groups personal data and associates this data with its data 
subject. Examples of forms are “customer data”, “purchase history”, and “financial 
information”. A “customer data” form, for example, may group the fields “name”, 
“street”, and “town”. 

2. The PO defines a privacy policy. A privacy policy describes what operations for 
which purpose by which data user can be performed on each PII type. For example, 
the “marketing department” may be allowed to “read” the PII type “contact data” 
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Security Administration Privacy Administration Customer Consent 




Fig. 1. Separation of duties for privacy authorization. 



for purpose “e-mail marketing”. In addition, a privacy policy may define opt-in and 
opt-out choices for the data subjects as well as certain privacy obligations such as 
“delete my data after 30 days unless parental consent has been given”. 

The privacy policy should be enterprise- and application-independent. Enterprise- 
internals such as the role structure should not be used in order to enable exchange 
of policy-protected data between cooperating enterprises. 

3. The PO and the SO define a deployment policy that maps legacy applications and 
their tasks onto the privacy- specific terminology used by the privacy policies. This 
mapping is specific to each enterprise. For example, whereas one enterprise maps 
a CRM system performing “product notification” as well as a printer for mass- 
mailings onto the action “read” for purpose “marketing”, another enterprise, which 
uses a legacy application instead of the off-the-shelf CRM system, maps this legacy 
application onto “read” for “marketing”. 

4. The PO defines a collection catalog that identifies the sources where data is col- 
lected. For each source, the catalog defines the collected data, its PII types, and a 
default policy. This information is associated with an empty form for each particular 
collection point. 

5. The PO defines an obligation mapping that translates application-independent obli- 
gations of the privacy policy (such as “delete”) into specific implementations. For 
example, a delete may be translated into an “unsubscribe” of the mailing list. 

6. The SO defines an access control policy that defines the roles and users within an 
enterprise. In addition, it defines which users or roles can execute which tasks of 
which applications. 

3.3 Collecting Personal Data, Opt-in and Opt-ont Choices, and Consent 
from a Data Subject 

The collection catalog identifies an empty form for each collection point. At a given 
collection point, the data subject enters its data in the fields of the given form. The filled- 
out form contains the fields and PII types of the entered data as well as a default policy. 
The data subject may then choose opt-in and opt-out choices defined by the policy. By 
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submitting the form, the data subject consents to the policy with respect to the selected 
choices. The choices are added to the form and the content of the form is stored. 

Note that enterprise privacy policies that model access down to the employee level are 
usually too complex for end-users. As a consequence, it is advisable to present a coarser- 
grained privacy policy to the customer (either as text or PSP) and to implement an E-PSP 
policy for internal enforcement. For managing consent, a graphical user-interface is 
needed that enables the data subject to opt-in or opt-out of certain choices of the policy. 

3.4 Granting or Denying Access 

The form associating the collected data, the privacy policy and the selected options is 
used to decide whether an access shall be granted. Authorization is granted in two levels. 
Whereas access control focuses on restricting the access of employees to enterprise 
applications, privacy control restricts the access of applications to collected data. 

Access Control: An employee acting as a data user with certain roles requests permission 
to perform a task of an application. The access control policy is used to verify that the 
data user with the given roles is in fact allowed to perform the requested task. If this is 
the case, the task is executed. This access control system is independent of the privacy 
authorization. 

Privacy Control: Once a running task of a corresponding application has requested 
access to certain fields of collected data, the privacy enforcement system retrieves the 
form and uses it to allow or deny the given request as follows: 

1 . The request identifies the task of an application as well as the fields to be accessed. 

2. The deployment policy maps the task onto a privacy-relevant operation and a pur- 
pose. 

3. The form identifies the PII types of the requested fields. 

4. The privacy policy and the data subject’s choices are used to decide whether the 
operation for this purpose is allowed on the given PII types. 

5. If the operation is denied, the access for the given task on the given fields is rejected. 

6. If the operation is allowed and the privacy policy specifies a privacy obligation, the 
obligation mapping maps the obligation to a task of an applicatioi£l 

7. If the operation is allowed, the task can be executed on the requested fields. 

3.5 Sticky Policy Paradigm 

An important aspect of E-P3P is the management of the data subject’s consent on a 
per-person and a per-record basis. This is done by the sticky policy paradigm: When 
submitting data to an enterprise, the user consents to the applicable policy and to the 
selected opt-in and opt-out choices. The form then associates the opt-in and opt-out 
choices as well as the consented policy with the collected data. This holds even if the 
data is disclosed to another enterprise. 

^ Applications are responsible for managing their data. As a consequence, they are required to 
implement tasks that correspond to obligations in the privacy policy. 



Platform for Enterprise Privacy Practices 



75 



Users/Groups 



Access Control System 




Fig. 2. Architecture for privacy enforcement. 

Note that policy management on a per-user basis is useful if consent and different 
sources are issues to be considered. Examples are managing data of different policy 
versions (e.g., due to different collection times), different user roles (e.g., paying users 
vs. users funded by advertising), or users from different jurisdictions (e.g., Europe and 
US). 

3.6 Systems for Enterprise Privacy Enforcement 

The authorization procedure described in Section is structured into the privacy en- 
forcement components depicted in Eigure|3 The components interact as follows to 
decide whether a task executed by a legacy application is allowed to access a protected 
resource; 

1 . A legacy application tries to execute a task on a protected resource. 

2. A resource-specific resource monitor shields the resource and captures the request 
of a certain task for certain fields. Eor each task, it asks the privacy management 
system for authorization. 

3. The resource-independent privacy management system obtains an authorization 
query identifying the fields to be accessed by a certain task of a certain applica- 
tion. It performs steps [DtoOlof the authorization procedure in Section^3to deploy 
the authorization query. 

4. The policy evaluation engine performs step 0] of the authorization procedure. The 
policy evaluation engine needs context and data to evaluate conditions. The resource 
monitor abstracts from resource and storage details by using a dynamic attribute 
service that provides values for data and context variables on request. 

The policy evaluation engine returns the decision as well as any resulting obligations 
to the privacy management system. 
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5 . The privacy management system returns the decision of the policy evaluation engine 
to the resource monitor. If obligations were returned, the applications are mapped 
onto tasks (step|5|l and sent to the obligations engine. 

6. The resource monitor performs the tasks if it has been authorized. If not, the task is 
denied. In addition, the resource monitor sends log data to the audit monitor. 

7. The resource-independent obligations engine stores all pending obligations. It eval- 
uates the associated conditions based on values obtained from the dynamic attribute 
service. When a cancel-condition becomes valid, the obligation is removed. When 
a start-condition becomes valid, the obligation is sent to the resource monitor for 
execution. 

4 E-P3P Privacy Policy Language 

In a typical privacy policy, there are such statements as “We will use your address for 
e-mail marketing if you are not a minor.” Usually, “we” denotes the data user and “you” 
denotes the data subject. We present a language to formalize such privacy policies. 

4.1 Structure of a Privacy Policy 

A privacy policy contains three elements. The first element is a header that contains 
information describing the policy such as a name, an author, and a version. The second 
element is the declaration that declares the identifiers, such as PII types, operations, 
and purposes. The third element are the authorization rules. The authorization rules can 
express what operations for which purpose by which data user can be performed on a 
given PII type. An example of a rule is that a “nurse” can “read” the PII type “medical 
record” for “care-taking” purposes. Rules can contain conditions that evaluate data in a 
form or external context variables. If an authorization rule contains a condition, the rule 
only applies if the condition evaluates to true. This includes opt-in and opt-out choices 
that are stored in the fields of a form. Examples include expressions such as “Age>18”, 
“OptlnToEmailing^True” or “8am<currentTime<5pm”. Authorization rules 
can contain obligations. Obligations are required consequences of performing the au- 
thorized operation. An obligation, for example, can define thaf after storing data, it must 
be deleted after 30 days unless parental consent is obtained. 

4.2 Declarations of a Policy 

The first part of a privacy policy declares the data model as well as the identifiers used 
in the authorization rules. The declared obligations and context variables have to be 
supported by an implementation in order to be able to interpret a policy. This data-model 
can be specific to a single enterprise. If enterprises exchange data, they are required to 
agree on a common terminology before being able to exchange policy-protected data. 

Mandatory Fields and Context Attributes. Mandatory fields must be declared for each 
PII type if field names are used as variables in conditions. The declarafion comprises 
fhe field name and its type. For example, the PII type “customer data” may mandate the 



Platform for Enterprise Privacy Practices 



77 



Table 1. Fields and mandatory context attributes that can be used as variables in conditions. 



Condition Variable 


Instantiated with 


context . currentTime 


the current time 


context . executor 


the data user performing the operation 


context . operation 


the identifier of the requested operation 


context . collectionTime 


the time at which this particular form has been collected 
from Data Subject 


field. [fieldname] 


A variable for each mandatory field. 


argument. [argumentname] 


A variable instantiated with each argument of a declared 
operation that is currently executed. 



fields “name”, “street”, and “town” of type “string”. In addition, a declaration can define 
a list of opaque external context variables that can be used for constructing conditions. 
TableQJlists the variables that can be used for constructing conditions. 

Data Users. The data user declaration defines the data users that are covered by the 
policy. Data users can either be distinct enterprises that use the data or different depart- 
ments within an enterprise. Data users are not structured in a hierarchical manner. Note 
that the access-control notion of roles is different from our notion of data users. Whereas 
roles model enterprise-internals, data users represent entities that are different from a 
data subject’s point of view. 

Purpose Hierarchy. Purposes are strings that identify the purposes for which an 
operation is executed. A privacy policy, for example, may authorize data disclo- 
sure for “research purposes” but not for “marketing purposes”. Purposes are ordered 
in a hierarchical manner. We use a directory-like notation for purposes (e.g., mar- 
keting and its sub-purposes marketing/email-mailings and marketing/ 
postal-mailing). If an operation is allowed for a given purpose, we assume that it 
is allowed for all sub-purposes. 

Operations on the Form. An enterprise uses the collected personal data by performing 
operations on it. Operations in our sense represent the privacy viewpoint regarding the 
use of data. A policy may use such terms as “use”, “read”, “disclose”, or “anonymize”, 
even if “use” and “read” are both implemented by SQL statements. Each operation 
description contains a list of argument descriptions. Each argument description contains 
a unique name identifying this argument. Prefixed with “argument . ”, the argument 
name can be used as a variable for constructing conditions that evaluate this argument. 
Note that operations are different from purposes: A policy may authorize the reading 
of certain data for billing purposes but may not allow the same data to be disclosed for 
billing purposes to another enterprise. 

Obligated Operations. The set of obligated operations declares the operations that can 
be used in defining obligations of an authorization rule. The most important operations 
are deleting forms, getting consent, and notifying the data subject. 
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Fig. 3. Rules authorize purpose and operation under a condition. 



4.3 Authorization Rules 

An E-P3P policy contains a set of authorization rules for each PII type. Each authorization 
rule (see EigureOI) states an operation that can be performed by a data user for a given 
purpose. Unless superseded by a more specific rule, each rule also holds for all sub- 
purposes of the declared purpose. A simple rule, for example, can state that an enterprise 
is authorized to read the data for statistics. 

In addition to the basic authorization of (purpose, data user, operation ) on a PII type, 
a rule may specify a condition that must hold. If no condition is included, the rule always 
holds. We distinguish two types of rules depending on their outcome: 

- Authorize: The operation is authorized. 

- Authorize and Obligate: The operation is authorized. Executing the operation results 
in the list of obligated tasks contained therein. 

At\ Authorize-m\s authorizes the triple (operation/purpose/data user) on a PII type if the 
condition is satisfied. An Authorize and Obligate-mle does the same but, in addition, it 
obligates a certain operation. In other words, if the authorized operation is performed, 
the obligation must be enforced as well. An example of such a rule is “the enterprise 
may store my data for handling orders” with the obligation that “data be deleted within 
30 days unless parental consent is obtained”. 

Conditions on Data and Context. Conditions define when an authorization rule can 
be applied. The standard cases for privacy policies are covered in the authorization 
triples (operation/purpose/data user) on a PII type. Policies that require more com- 
plicated authorizations are augmented by conditions that evaluate the data and con- 
text listed in Tabled Variables and constants are evaluated using operators that de- 
pend on their type. The resulting Boolean expressions can then be further combined 
using Boolean operators. Examples of conditions are “ ( thisForm . age>18 ) OR 
( thisForm. consent^True) ”. and “context . executor^ field . POP” . 

As conditions can evaluate fields containing choices, an explicit mechanism for opt- 
in and opt-out choices is not necessary. Eor example, by enabling an opt-in choice to 
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e-mail direct marketing, a policy can declare a Boolean field “yes-to-emailing”. The 
corresponding rule authorizes an operation “send mail” for purpose “direct marketing” 
only if this data field is set to true. 

Descriptions of Obligated Tasks. An “authorize and obligate” rule specifies a list of 
obligated tasks. Each obligated task contains a list of operation descriptions (operation 
name and arguments) together with a start- and cancel-condition. The intuition is 
that the system queues obligations that result from performing an authorized operation. 
The start-condition then defines a precondition. If this condition is satisfied, the 
obligation must be executed. The cancel-condition defines a postcondition. If this 
condition is satisfied, the obligation is no longer necessary and may be deleted from the 
queue. Obligation conditions are expressed in the language for expressing authorization 
conditions with the extension that obligations can use the prefix for variables with 
deferred instantiation. Those variables are not instantiated during authorization but every 
subsequent time the obligation conditions are evaluated. When the obligation conditions 
are evaluated, the non-deferred variables have already been instantiated and are now used 
like constants. The variable “today” is instantiated when the authorization condition is 
evaluated, whereas the variable “ " today” is instantiated when a start or stop condition is 
evaluated. As a consequence, an obligation condition “''today > today + Iday” 
becomes true one day after the authorization rule has been applied. 

An example of such a complex obligated task is the obligation that a form must 
be deleted after 30 days if parental consent is missing. For this task, the operation 
is “delete”, and the start condition is “''today > today + 3 Od”, whereas the 
cancel condition is “''ParentConsent=True”. 

4.4 Policy Evaluation Logic 

We now describe how privacy policies are evaluated. This algorithm refines the policy 
evaluation (step0in Section|^3) for the rehned privacy policies. The algorithm answers 
a request whether a data user is authorized to perform a certain operation for a given 
purpose and returns any resulting obligations. The algorithm assumes that for a given 
purpose, data user, PII type, and operation there is at most one rule with obligations and 
a condition that evaluates to true for any given data and context. For a given policy, the 
decision can depend on the following external inputs: 

1 . The data user, the requested operation, its purpose, and the PII type on which the 
operation shall be performed. 

2. The data contained in the data fields that are declared in the policy. 

3. The context information that has been declared in the policy. 

The first item is a mandatory input for any authorization request. The latter two are 
additional information that is retrieved on request using the dynamic attribute service. 
The policy evaluation engine returns its decision and resulting obligations. 

Authorizing Access. The authorization request is evaluated as follows: 

1. The engine retrieves all rules that apply to the given tuple (PII type, data user, 
operation). 
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2. The engine retrieves the set of most specific rules for the given purpose. This is a 
longest-matching prefix search on the purposes of the applicable rules. The result is 
a reduced set of applicable rules. 

3. The engine evaluates the conditions of each remaining applicable rule. To evaluate 
the conditions, the authorization engine instantiates the variables contained in the 
condition using the context attributes that are retrieved via the dynamic attribute 
service. All rules with conditions that evaluate to false are removed from the set of 
applicable rules. 

4. If there is more than one remaining rule with obligation, the request is denied due 
to inconsistencies in the policy. 

If the set of applicable rules is empty (i.e., if the conditions were not satisfied), the request 
is denied. If there is only one remaining rule with an obligation, the request is allowed 
and the obligation is returned. If none of the remaining rules contains any obligation 
(i.e., if multiple rules authorize the request but none imposes obligations), the request is 
granted. 

5 Conclusion 

We have described the first scheme for enterprise privacy management. The enterprise- 
independent privacy policies are comprehensive and more expressive than existing pro- 
posals for enterprise-internal privacy policies. The deployment scheme enables enforce- 
ment of a common privacy policy for a variety of legacy systems. 

The viable separation of duties between the privacy officer and the security ad- 
ministrator enables secure and efficient management in practice. The intuitive consent- 
management paradigm enables customers to retain greater control over their personal 
data. 

Our methodology protects personal data within an enterprise with trusted systems 
and administrators against misuse or unauthorized disclosure. It cannot protect data if the 
systems or administrator are not trusted. Therefore, it merely augments a privacy-aware 
design of enterprise services that minimizes the data collected. In the desirable (but 
unlikely) scenario where an enterprise can offer its services without collecting personal 
data, our privacy management methodology would be rendered obsolete. 

To correctly specify privacy rights and obligations that are being promised by privacy 
statements and mandated by a number of legislatures, the privacy officer must be able to 
reconcile easily what should be authorized with what is actually authorized. Therefore, 
we have developed a formal model for authorization management and access control in 
privacy protecing systems [8J. 
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A Example - Privacy at Borderless Books Inc. 

In this section, we outline the most important elements of a privacy policy for a hypo- 
thetical online bookstore called Borderless Books. 

A.l Designing a Privacy Policy 

The main business of Borderless Books is to sell books. Customers compose an order us- 
ing a shopping basket. To process an order a customer enters his user name and password. 
This identifies a customer profile. The order is stored as part of the customer profile. 
Name and credit card number of the form are sent to a payment processor for autho- 
rization. Borderless collects four types of PIl in its customer database: CustomerProfile 
(CP), PaymentDetails (PD), OrderDetails (OD), and enterprise-internal management 
data (MD). The entry points are a Web page for creating customer profiles as well as the 
shopping basket. Both entry points are governed by a single policy. 
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Table 2. Fields of personal data collected by Borderless Books. 



Fieldname(s) 


Data Type 


PII Type 


Description 


Name, Surname, Address, 


String 


CP 


Fields containing collected PII. 


PostalCode, Phone, Email, 
ParentName, ParentEmail, 
CardType, CardOwner, 
CardNumber 
Birthdate, CardExpiry 


Date 


CP 


Birthdate for determining the age. 


Birthdate, CardExpiry 


Date 


PD 


Credit card expiry data. 


YesToMarketing 


Boolean 


CP 


Opt-in choice for personalized market- 


ParentConsent 


Boolean 


CP 


ing. 

Set to tme if parent consent has been 


ParentID 


String 


CP 


given. 

Identifier of the parent. 


Password 


String 


CP 


Password for authentication of the data 


OrderHistory 


List 


OD 


subject. 

The order history managed by the en- 


DataSource 


String 


MD 


terprise. 

The source where the form has been 


UserName 


String 


MD 


obtained. 

The user name of the data subject. 



A customer profile contains the fields UserName, Password, Name, Surname, Ad- 
dress, Phone, E-mail, and Birth date as well as the credit card information (CardType, 
CardNumber, Expiry Date, CardOwner). The former are of PII type CustomerProfile 
whereas the latter is PaymentDetails. The OrderHistory is of PII Type OrderDetails. If 
the data subject is a minor, he also has to provide the name and email of a parent. The 
customer can give permission to Borderless to use his PII for marketing, to send the PII 
to third-party marketers, or to use depersonalized data for statistics. 

Identifying Privacy Practices. The following list contains a list of purposes for which 
PII is used as well as the corresponding operations: 

Profile Management: A customer consents to the data being stored. If a customer is a 
minor and if no consent by the parents is given, the data is deleted within 30 days. 
Upon request of the customer, the data is deleted. 

The purpose profile management uses the following operations: The operation 
Store is used to store the data at the enterprise. The operation Update enables 
an application that acts on behalf of the data subject to update the collected data. 
The operation Delete deletes a form and all copies. Before deleting the form, the 
operation sends a delete request to all parties to whom the data has been disclosed. 
Processing Orders: When Borderless Books processes Joe’s order, it sends the credit 
card company an invoice for payment. Borderless Books is authorized to disclose 
payment data from the slip to the credit card company, but not to put the titles of the 
ordered books on the invoice. 

This purpose uses the following operations: The operation Read reads the data 
for local use by Borderless Books. The operation Write writes an enterprise data 
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field. The parameters are a field name (“field”), a new value (“value”). Operation 
SendDisclosure sends PII to another enterprise. The input parameters are an 
identifier of the party to whom the data shall be disclosed (the “disclosee” of type 
Role), the purpose (“purpose”), as well as a subset of the fields that shall be disclosed 
(“fields”). Operation StoreDisclosure stores a form that has been obtained 
from another enterprise. The input parameters are a form (“form”) and a purpose 
(“purpose”). 

Personalized Marketing: If the customer consented to personalized marketing, his 
data (including the order history) can be disclosed to third-party marketers. 

The data model is depicted in Table 0 It shows the collected fields as well as their PII 
type together with a brief description of their intended use. 

A.2 Formalizing the Privacy Policy 

The goal is to formalize the following policy: 

Borderless Books uses your data only for processing orders. 

The data is not disclosed to any other party except for processing payments. The 
payment processor is obliged to delete the data within 1 day. 

If you consented to personalized marketing, we disclose the data to Direct Marketing 
Inc., which is not allowed to disclose the data further. 

For minors, we will delete the data if no parental consent is given within 30 days. 
Table 0 contains the rules that formalize this policy. The necessary declarations can be 
derived from this table. 



Table 3. Rules defining Borderless Bookstore’s privacy policy. 
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Update Borderless adding consent field=ParentConsent AND 

initiator=ParentID 

Delete Any none initiator IN {ParentID, DataSubject, 

DataSource, DataUser} 
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Abstract. To offer personalized services on the web and on mobile de- 
vices, service providers want to have as much information about their 
users as possible. In the ideal case, the user controls how much of this 
information is revealed during a transaction. This is a tradeoff between 
privacy and personalization: if the disclosed profile is too complex, it 
may become a pseudonym for the user, making it possible to recognize 
the user at a later time and link different revealed profile parts into one 
comprehensive profile of the individual. This paper introduces a model 
for profiles and analyzes it with the methods of probability theory: how 
much information is revealed and what is the the user’s probability of 
staying anonymous. The paper examines how likely it is that a provider 
can link different disclosed prohles and recommends algorithms to avoid 
a possible privacy compromise. 



1 Introduction 

People are more willing to try and use a service, if they feel it is relevant to 
them. They want more appealing content, matching their personal interests and 
preferences: personalization is starting to become the key to success. At the same 
time, as computers and the Internet becomes more and more widespread, people 
are starting to be more conscious about their privacy and want to minimize the 
chance of a “Big Brother” . 

In order to get personalized services matching individual preferences, the 
user must disclose some kind of profile to the provider of the service. But this is 
dangerous: by providing more information, he or she is less anonymous towards 
the service (we use nymity terminology as defined in If the user reveals 
too much during a transaction, then the provider may use that after that to 
recognize him or her. Section El examines this problem in more detail. 

This paper tries to find an answer to the question: how much of a profile can 
be revealed without compromising the anonymity of the user? The paper uses the 
methods of formal probability theory. Section 01 suggests a simple probabilistic 
model to represent profile information, and analyzes how effectively a service 
provider can discover links between disclosed profiles in this model and how the 
user should disclose parts of the profile to avoid this. Further sections discuss 
how this model could be extended to be more realistic, and what are the possible 
effects of these extensions. 

R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. 85-I^HI 2003. 

© Springer- Verlag Berlin Heidelberg 2003 
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2 Dangers of Profile Linking 

It is a common understanding that anonymity is more likely to protect privacy 
than pseudonymity ( 0 , 0 , 0 ), hence anonymous transactions should be used 
whenever possible. 

2.1 Complex Profile as a Pseudonym 

Based upon existing practices and profile standards (for example jSl)? a profile 
containing user preferences may be quite complex. The profile of a user (let us 
name her Alice) could include her favorite music, movies, clothing, food, travel, 
electronics, etc. If the profile is detailed enough, the number of possible profiles 
will be so large that the probability of finding anybody who has a profile identical 
to Alice is practically zero. If Alice reveals her complete profile during every 
transaction, then a service provider can use the disclosed profile to recognize 
her during later transactions |0j (but not necessarily knowing her real identity), 
without Alice knowing this. 

Hence, in the further part of the paper we assume that whenever Alice dis- 
closes her profile, she discloses only a part of it that can be used to offer person- 
alized services but insufficient to recognize her. 

2.2 How a Service Provider Can Link Disclosed Profiles 

Consider that Alice has been using a specific service, say she visits a music store. 
Suppose she has already revealed some parts of her profile towards the store. 
If the shop could somehow link some of these subprofiles and mark them as 
belonging to the same person, it would immediately get a more comprehensive 
profile of Alice. This would certainly increase the probability of the provider 
being able to link this profile to further subprofiles of Alice. Once this happens, 
Alice is in trouble, because the risk that her further subprofiles can be linked is 
highly increased and the likelihood of staying anonymous is reduced. 

As we see, it is highly desirable to prevent the service provider from being 
able to link profiles. 

3 Analysis of the Simple Profile Model 

In this Section, we construct a simple model to represent profile information that 
can be analyzed easily. 

3.1 Participants of the Model 

1. The User. Alice has her preferences and interests collected into a profile. She 
wants to get services that match her profile, but also wants to protect her 
privacy and make it impossible for anyone else to combine the subprofiles 
revealed during different sessions. 
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2. The Service Provider. The music store gets subprofiles from the user during 
transactions and offers services that match the preferences of the user. It also 
wants to deduce as much information from the disclosed profiles as possible, 
in particular it wants to link profiles belonging to the same user, because 
this is a key to learn more about the behavior of its users. 

3. The Communication Channel. From the perspective of the service provider, 
the channel between itself and the user is pseudonymous for one transaction 
and anonymous between different transactions: the provider does not know 
whether two transactions have been committed by the same person or not. 
The end of a transaction occurs when the user surfs to a different web page, 
finishes communication using her Bluetooth device, etc. If the channel were 
not anonymous (for example the user had a communication address that is 
fixed across transactions), then the provider would be always able to combine 
all information about the user and there would be no point in randomly 
disclosing different parts of the profile. 

In practice, the anonymity of the communication can be ensured by mix 
networks 0. 0 In special domains such as Bluetooth, there are efforts 
going on to make the protocol anonymous by its nature, which is currently 
prohibited by fixed unique device addresses |0|. 



3.2 Profile Representation 

We assume that there is a large set that contains all possible elements that can 
be part of a user profile. We call this set the Universe: 

[/:= {ei,e 2 ,...,em}; |17| = m; m < oo. (1) 

In our music store example, U can be a list of musical styles, artists, albums, 
etc. The profile x of a single user is a subset of the Universe: 

xCU. ( 2 ) 

For any Ci G U, et G x means that the element Ci can be linked with the 

user, such as a favorite artist or album. (This is referred as implicit voting in 
collaborative filtering literature m-) 

When the user engages in a session with a service provider and wants to show 
her profile, she reveals a non-deterministic subset of her profile: 

D{x) C X. (3) 

This means that Alice never discloses that her profile does not contain a 
particular entry. If e G D(x), then the provider can be sure that e G x, but if 
e ^ D{x), then it can say nothing. In the ideal case, D{x) is sufficient to create 
a personalized service (offer appealing CDs) , but not enough for the provider to 
identify the user as one of its previous clients. 
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3.3 Further Assumptions 

In order to keep the model simple, we also assume that the elements of the profile 
are independent: 

1 . For any elements ei, 62 so that ei ^ 62, the events ei S x and 62 S x are 
independent for any possible Ci and profile x. 

2 . Any element et G U is part of a profile with the probability 

Pi = Pr{ei G x}. ( 4 ) 

3 . For any element G x, the probability of that it is revealed to the provider 
during one showing is 

Qi = Pr{ei G D{x)\ei G x}. ( 5 ) 

Note that this assumption is a simplification of reality. Real world elements 
may to depend on each other: if Alice likes Aerosmith, it will be more likely that 
she also likes AC/DC or some other similar band, compared to the case if her 
favorite artist is Michael Jackson. 

Another drawback of this model is that because of the independence between 
elements, it is not possible to measure whether a disclosed profile is a good 
representation of the real profile and the preferences of Alice, since disclosing 
an element does not say anything about other elements. This paper does not 
examine how well a disclosed profile represents the preferences of the user. 

Nevertheless, the effects on privacy can be analyzed relatively simply, as it 
is discussed in the next section. The possibilities of a more complex and more 
realistic model will be analyzed in Sect.El 

3.4 Analysis of Linkability 

Now let us examine the probability that the music store is able to discover a 
link between two disclosed profiles, d and d' {D is not written as a function here, 
because it is unknown which profile it came from): 

D = d;D' = d' (6) 

Let X = X' denote the event that d and d' have been disclosed from the 
same profile. (Note that this is not the event that x and x' contain exactly the 
same elements.) 

Let f{d) denote the probability distribution of a disclosed profile in the case 
when the profile and its disclosed profile d are independent of all other previously 
revealed profiles (for example, when the very first client comes in): 

f{d) = Pr{D = d}. ( 7 ) 

Let g{d, d') denote the joint probability distribution of two disclosed profiles 
d and d' in the case when they have been revealed from the same user profile: 

5(d, d') = Pi{D = d;D' = d'\X = X'}. 



(8) 
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Let the fi and gi denote the variants of formulas / and g when they take 
only a single into account rather than all e G 17 (the imaginary case when 
U = {ci}). Now 



Md^) 



I piq^ for a G d; 
\ 1 - piQi for a ^ d; 



(9) 



and 



{ Piqf for Ci G d;ei G d'; 

- qi) for €i G d;€i ^ d' or ^ d; Ci G d'; (10) 

(1 - Pi) + p,{l - for a ^d;ei^ d' . 

Since elements of U are independent from each other, we can calculate prob- 
abilities for whole disclosed profiles by multiplying the probability values of each 
element: 



m m 

f{d)^l[Md,); g{d,d') = l[g,{di,d'i). (11) 

i=l i=l 

Let r denote the probability that two disclosed profiles come from the same 
profile, in the case we do not know anything about the disclosed profiles: 



r := Pr{X = X'}. 



(12) 



The variable r represents an expectation of the provider about the number 
of users using its service: if the provider expects n users, then r = 1/n. The 
knowledge of r (which is independent of all disclosed profiles) is required to 
calculate the probability of a match for two known disclosed profiles. 

Now we can calculate the probability of a match using the conditional prob- 
ability theorem: 



Pr{X = X'\D = d; D' = d'} 



Pt{D = d;D' = d'; X = X'} 
Pt{D = d;D' = d'} 



(13) 



where 



Pr{L» = d;D' = d'; X = X'} = Pr{D = d; D' = d'\X = X'} Pr{X = X'} 

= g{d,d')r (14) 

and by the theorem of total probability 

Pt{D = d;D' = d'} = Pt{D = d;D' = d'\X = X'} Pr{7f ee X'} 

+ Pr{D = d;D' = d'\X ^ X'} Pi{X ^ X'} 

= g{d,d')r + f{d)f{d'){l-r). (15) 



Hence, 



Pr{X = X'\D = d; D' = d'} 



g{d,d')r 

g{d,d')r + f{d)f{d'){l - r)' 



(16) 
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This formula, using the above definition for f{d) and g{d, d') gives us the 
probability of that two concrete disclosed profiles have been revealed by the 
same person. 



3.5 Protecting the Privacy of the User 

Let us assume that the store uses some limit I, and if 



Fr{X = X'\D = d-,D' = d'}>l, (17) 

then consider d and d' as belonging to the same person. If we formalize the 
condition for a match, we get: 



g{d,d')r ^ 

g{d,d')r + f{d)f{d'){l-r)'' 

We can transform this into: 

^ g{d,d') 1{1 - r) ^ _ 1 

^ f{d)f{d') '' r{l-l) r{l-l)' 



(18) 

(19) 



From now on we call the quantity fj, the match factor. In order to allow the 
service provider to link profiles, g must be greater than a limit A. 

Since 1 — r and I are both close to 1, A ~ l/(r(l — ?)). This means in practice 
that to link two profiles from a population of 10000 with 99 percent confidence 
is roughly as hard as to link from 100000 with 90 percent confidence. 

Because the elements are independent, we can transform the left side of 
the previous equation into: 



g{d,d') ^ Y[Zlgiid^,d'f) ^ -^ 1 - g^{di,d'i) 

f{d)f{d') n™i h{d.) uZi m) ■ 



We got the result that the match factor of the two disclosed profile is a 
product of the match factor for the individual elements of U. By analyzing how 
much information is revealed by individual elements, we get: 

Table 0shows that for practical cases - where Pi is usually significantly lower 
than 1 - the most information is revealed if there are elements that are part of 
both disclosed profiles. 

Based on this, we can set a lower bound on the value of g: 




1 + 



PidiZ-Pj) '] 



(21) 



where Xi is 1 if was disclosed both in d and d' , 0 otherwise. 

This method can be used to estimate how much correlation is revealed by 
two disclosed profiles (with any non-uniform pi). Linking can be avoided if the 
result of the above formula remains less than the limit 1. We can also see that 
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Table 1. How Elements Influence the Match Factor 





fi{di)fi(d'i) 


gi{di,di) 




Gi ^ di , G{ ^ c/j 


phi 


Pid 


i »i 

Pi 


€i G di,ei ^ di 


2 2 


2 


^ qi{l - Pi) ^ , 


or 

Gi ^ d{ , Gi ^ cZj 


Pi<H - Pi Pi 


Piqi - Piqi 


1 - Piqi 


Gi ^ d{^ Gi ^ di 


1 - 2piqi + phi 


1 - 2piqi + Piqi 


{l-PiqiV 



more information is revealed if pi is small, i.e. Alice discloses that her profile 
contains a rare artist. 

This check can also be combined with the conservative strategy to estimate 
how much information is revealed by a single disclosed profile according to con- 
servative measures (by comparing the profile to itself), introduced in Sect. HTTl 



3.6 A Special Case: Uniform Probabilities 

Assuming Vi : pi = p~, qi = q is a, heavy simplification and unrealistic in real- 
world cases. However, calculations in this case are simpler and the results can 
be interpreted easier, so we consider this case as an important step towards 
understanding how more realistic models work. 

If we assume uniformity, the previous formula takes the following form (where 
m = |d n d'l is the number of common elements disclosed in both profiles): 



p. ^ 



, Pq{l-P) 

\p) V (1-w)' 



m 



which can be transformed into a direct condition on m (since p < X): 



( 22 ) 



in < 



lnA-mln(l+ 
— Inp 



(23) 



If the bound on rh is negative, the parameters are such that the fact that 
both disclosed profiles are empty (which is represented in the right part of the 
nominator) is in itself enough to link two profiles and this estimate cannot be 
used for creating an lower bound for in. 

Also, m is not independent from p, since p represents the same ratio for all 
users that TO/|a:| represents for this particular user (where |a:| is the number of 
elements actually present in the profile of the user). To simplify our model, we 
did not take this into account, and assumed that the two are independent. 



3.7 Strategies for Disclosing the Profile 

Based on the above calculations, we can create strategies for how the client 
should reveal profile information, avoiding /i being greater than A for any two 
disclosed profiles. 
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Although the model of the service provider assumes that clients disclose their 
profile on a non-deterministic basis, the actual algorithm of a single user may be 
deterministic. The user can choose any algorithm that makes it difficult to link 
the profiles. If many clients use the same method to optimize disclosed profiles, 
that makes it possible for the service provider to modify its strategy accordingly, 
but we ignore this effect. 

Conservative Approach. Regardless of how many times a profile is revealed, 
Alice can be always safe if she compares each disclosed profile to itself before 
disclosing, i.e. she checks what /i would be if the store would try to link two 
instances of the same disclosed profile. This results in a fairly low limit, but 
is perfectly safe. (In the uniform case, this means that no disclosed profile can 
contain more than m elements.) 

The Effect of Extra Elements. If the user wants to disclose more elements, 
the probability of linking will not be zero any more. However, some methods can 
be used to keep the amount of information lost at the minimum level and prevent 
the avalanche effect and getting a comprehensive profile of the individual. 

If for any two disclosed profiles, is greater than A only if the profiles are 
identical, then the provider will be able only to deduce that the two profiles 
have been disclosed by the same person. It will not get a richer profile by linking 
those two and will not be able to link further disclosed profiles to these ones. 
This is only a minimal loss to anonymity that can be acceptable in some cases, 
especially when this happens relatively rarely. 

One Extra Element. When disclosing a profile with this method, /x can be 
greater than A, but taking out any single element should mean that /x is lowered 
below A. This ensures that only identical profiles can be linked by the store. (In 
the uniform case, this means that disclosed profiles cannot contain more than 
m + I elements.) 

To minimize the loss, the user can choose to reveal profile subsets in deter- 
ministic order so that the delay between identical showings is maximal (but care 
should be taken that the provider is unable to take an advantage of the fact that 
profiles are disclosed according to a pattern) . 

With this strategy, user anonymity is shifted a little towards pseudonymity, 
because the provider will be able to link some disclosed profiles as they would 
belong to many different users. How large the shift is depends on the number of 
possible combinations: 1 means total pseudonymity (when the whole profile is 
disclosed every time) , while oo means total anonymity (if a different combination 
can be revealed every time). 

More Extra Elements. In theory, even more extra elements can be revealed 
if the disclosed sets are chosen carefully and fulfilling that n > X only in the case 
of identical profiles; at the price of reducing the number of possible combinations 
that can be disclosed, hence resulting in a bigger shift towards pseudonymity. 
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4 Extended Model 

Although the previous model can be analyzed easily, it is not very realistic. An 
important step towards reality is if we do not assume independence between the 
elements of the model, and let them be correlated. 

4.1 How Correlation Changes the Model 

At this point, we do not assume any particular type of correlation. Instead, let 
us introduce two functions that replace Pi and Qi in equations 0 and (0. These 
functions operate on the whole profile rather than individual elements: 



P{x) = Pr{A = x}, and (24) 

Q{x, d) = Ft{D = d\X = x}. (25) 

With these functions, we can also replace / and g, defined in Q and 10): 

/(rf) = E P{x)Q{x,d), and (26) 

\/x 

g{d, d') = ^ P{x)Q{x, d)Q{x, d'). (27) 

Vrc 



These expressions are expensive to be computed in their raw form, since 
they have to be evaluated for all possible profiles (in fact only for those profiles 
X where d G x). However, they replace the relevant parts of the simple model. 
Equations da - dED remain true and they can be used together with the above 
functions to examine this general case. 

In our case, probability distributions are not formally given, they can only be 
estimated based on real-world experiences. Thus, correlation between our profile 
elements can not be calculated formally from the distributions, but has to be 
obtained by sampling user habits. Hence, the correlation between profile elements 
is not something to be calculated; it is rather something to be measured. 

Hence, the P function has relations to recommender systems and collab- 
orative filtering methods. In collaborative filtering, we have some information 
about preferences of previous users, and use this information to make recom- 
mendations for new users, based on partial information about them. However, 
the same knowledge can be also used to estimate the value of the P function, 
i.e. how likely a given profile is. With more or less modification, we can adapt 
these algorithms for our purposes. 

There are many different types of collaborative filtering methods including 
correlation between users m , vector similarity methods H21. Bayesian clustering 
and Bayesian networks methods m 

The Q function represents how elements are disclosed from the profile. Al- 
though it would be possible to create more complex model, the independent 
model is reasonable in this case also. 
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4.2 Converting the Bayesian Clustering Method 

Although they have similarities with the current case, in general it is not straight- 
forward how to set up a method for calculating P using a given algorithm. Let 
us do this for the Bayesian cluster model. 

In its original form, the probability of profile elements are conditionally in- 
dependent given membership in an unobserved class variable C which can take 
a small number of discrete values: 

m 

Pr{C = c, Ai = , . . . , A„ = = Pr{C = c} n = x,\C = c} (28) 

i=l 

Existing algorithms can be used to set up model parameters by learning 
from previous samples, such as the EM algorithm m Once the parameters are 
learned, P can be calculated the following way: 

P(x) = ^ Pr{C = c,Xi=xi,...,Xm = Xm} (29) 

Vc 

With some optimizations, this method can be used to evaluate and (HZI), 
and hence calculate 

4.3 Alternative Method: Measuring Correlation 

To model correlation, we can also define a factor that modifies the probabilities 
of unknown elements based on the elements we know: 

Pr{ei e x\D = d} = c(e*, d)p^, (30) 

meaning that the probability of e^’s presence in a profile gets multiplied by 
c(ci, d) once we get to know that the elements contained in d are already there. 
The value of c may be smaller or greater than 1. 

We add correlation to the simple model by replacing pi with c{ei,d)pi in all 
calculations described in Sects. E31 and 13.51 If we have samples from previous 
session, c(ej, d) can be measured by examining the frequency of i in the samples 
where the elements of d are also present. 

Note that this kind of correlation is not something to be formally calculated, 
it requires samples from previous sessions or other estimate on the values of c. 
Nevertheless, this approach could also help extending the model. 

5 Practical Problems 

5.1 Determining the Parameters of the Model 

In the previous sections we assumed that the service provider and the user have 
some estimate on how to compute P{x) and Q{x, d). In a real life scenario, both 
the user and the shop must provide some estimates for these probabilities. 
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Service Provider. The provider has a relatively comfortable situation: it prob- 
ably has access to sales or other statistics, hence it has a good basis to estimate 
P{x). It has a number of previously disclosed profiles, which it can use to estimate 
Q{x,d). If we consider the simple model, the shop can compute qi = {piqi)/pi- 
For more complex models, this is certainly more difficult. 

User. The knowledge of P{x) is more important to the user, but Alice also 
needs to have an estimate of the global Q{x, d). For convenience, she can assume 
that the global probability of disclosing a combination matches her own personal 
strategy. 

The user’s position is more difficult than the provider’s: by default she knows 
only her own profile, and nothing about other users. It is also unlikely that the 
user is willing to run data mining software in her spare time to collect this 
information. The possible solutions include: 

Worst-Case Estimate. The user can make a worst case estimate that works 
well even under difficult conditions, such as when p is small. 



Rating System. There can be some trusted central agency that collects ratings 
and users can fetch this data. Rating can be done sampling a small number of 
representative users, using the Nielsen rating system that was originally used for 
measuring popularity of television broadcast. Rating could be carried out off-line 
for all profile entries, and iterated over time to take newer profile elements and 
pereference changes. 

Anonymous Voting. Another approach is that each user should submit com- 
plete or partial profiles regularly in an anonymous manner. The profile entries 
could be encrypted and posted to a mix network that mixes individual entries, 
resulting a lot of entries as output from the network. This information could be 
used to estimate the popularity of entries. 

This method is more accurate than previous ones, since it works on a large 
dataset representing preferences from many users, however, correlation informa- 
tion is lost. One solution for this is that the users post entry-pairs or bigger 
tuples; but care should be taken that this information is insufficient to compro- 
mise user privacy. 



5.2 A Concrete Scenario 



This section introduces an imaginary scenario with the Music Store and shows 
the results of analysis with the simplest model, also assuming uniform probabil- 
ities. We assume that the profile of the user contains her favorite artists, with 
the following parameters: 

The condition on rh is 



m ^ 



In 10® - 1000 In 1.01446 



3.65280. 



In 20 



(31) 
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Table 2. Parameters for the Imaginary Scenario 



Number of known artists: 


m = 


200 


Pr{An artist is in a profile}: 


p = 


0.05 


Pr{An entry is disclosed}: 


q = 


0.3 


Expected population: 




10000 


Probability limit for matching: 


1 


0.99 


Average artists per user: 


pm — 


10 


Average artists per disclosed profile: 


pqm — 


3 


Matching limit for fi: 


A « 


10® 


Table 3. The Value of jj,i for Different Cases 






P>i 




6i ^ d, €i ^ d' 


20 




Ci e d,6i ^ d' or a ^ d,6i € d' 


« 0.710659 




6i ^ d, a ^ d' 


Ri 1.01446 





For the service provider this means that it can link two disclosed profiles 
if they have at least 4 common artists, and for the user this means that it is 
perfectly safe to disclose 3 artists from the profile when using the conservative 
approach. 

The probability calculated by the service provider for a match if two disclosed 
profiles contain exactly the same 3 elements, based on (PI is: 

Pr{X = X'\d = d'; |d| = |d'| = 3} a: 0.655447 a: 65.5%, (32) 

which (hopefully) does not represent enough certainty to link them. 

Note that the limit we got for m in this case is equal the value of pqm, but 
this would not be necessarily true if were to change the initial parameters. This 
equality means that the user can be sure that the algorithm works if all other 
users use the same strategy. 

6 Future Work 

Further work is required to identify which correlation algorithm is the most re- 
alistic in this environment, examining how collaborative filtering methods could 
be extended for this purpose, and possibly finding new algorithms. 

It is also important to test the methods on real datasets. From the perspective 
of the service provider, it would be interesting to try how efficiently matching 
works, how many real and fake matches could be detected. From the user per- 
spective, it is important to test the recommended algorithms and examine how 
they behave if the user does not have access to the whole dataset, only some 
extract of it. This, combined with collaborative filtering methods could be also 
used to measure how the disclosed profile can be used to make recommendations 
for the user. 
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Another area to work on is how patterns can be defined to disclose profile, if 
the user wishes to disclose more elements (see Sect. Id. 711 . 

Finally, another exciting possibility would be to introduce the disclosure of 
“fake” elements, i.e. allowing Pr{ci S D{x)\ei ^ x} > Q. 

7 Conclusion 

We presented a model for representing profile information and a method for esti- 
mating how profiles can be disclosed to avoid the linking of two transactions. The 
most important part of the result is the match factor /r based on a-priori knowl- 
edge of parameters. This can be used to provide estimates about the amount of 
information revealed by disclosed profiles. We also created a theoretical frame- 
work for integrating with collaborative filtering methods to represent correlation 
and allow more realistic estimates. 
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Abstract. Following from privacy requirements, new architecture solu- 
tions for personalized services are expected to emerge. This paper pro- 
poses a conceptual framework for designing privacy enabling service ar- 
chitectures, with a special emphasis on the mobile domain. The treat- 
ment of the subject is based on the principle of separating identity and 
profile information. Basic building blocks such as Identity Broker, Pro- 
file Broker, Contract Broker and Authenticator are identified and then 
put together in different ways resulting in variations on the architecture 
theme. 



1 Introduction 

Personalizing services requires personal information. But it should not require 
all the information about a user. Special care has to be taken when personally 
identifiable information (PII) is disclosed for a service. In general, if no more 
information is disclosed about the user than what is absolutely necessary, max- 
imum privacy can be maintained. We call this intuitive approach the principle 
of minimum disclosure. 

Another intuitive principle is separation of identity and profile information. 
Because of the sensitivity of PII, it should be handled separately from the rest of 
user information which we call the profile. Without disclosing her real identity, 
a user can diclose a relatively rich set of personal information so that service 
personalization becomes possible. For example, in order to get good restaurant 
suggestions, it is enough to give an anonymous profile of eating habits. 

When thinking about architecture solutions for future personalized services, 
the above principles should always be kept in mind. As this paper also illustrates, 
we expect new components with new responsibilities and hence new architecture 
schemes to emerge in the near future. 

Section E| summarizes the required fundamentals. Section 0 introduces the 
most important building blocks of privacy enabling architectures. In section 0 
the basic building blocks are put together in different ways resulting in different 
architecture solutions. 



R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. flO- Tniol 2003. 
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2 Fundamentals 

2.1 Basic Notions 

Identity and Linking. Identity is a central notion of privacy. The primary 
meaning of the word is abstract: “the condition of being a specified person” 
(Oxford dictionary or “the condition of being oneself and not another” (Mac- 
quarie dictionary j]j. In information technology, the word is also used to refer to 
a set of information about an entity that differentiates it from all other, similar 
entities. The set of information may be as small as a single code, specifically de- 
signed as an identifier, or may be a compound of such data as given and family 
name, date-of-birth and postcode of residence p. 

True identity (or real identity) refers to a set of information leading to the 
real person (cf. verinymity in section^. 

Linking means establishing connection between data. In the context of pri- 
vacy, it becomes important when the (non-identified) information about a person 
gets linked to its true identity, e.g. buying habits to name and address. 



Nymity. Nymit'^ is the extent to which identity information disclosed during 
a session or transaction is linked to a true identity of an entity. There are three 
nymity levels: verinymity, pseudonymity and anonymity^. 

In case of verinymity, a verinym is used. “By verinym or True Name we [. . . ] 
mean any piece of identifying information that can single you out of a crowd of 
potential candidates. For example, a credit card number is a verinym” jSj. 

In case of pseudonymit-^ a pseudonym is used. A pseudonym is a persis- 
tent fictitious name. It is unique enough so that the communicating party can 
be distinguished from other parties but does not contain enough information 
for getting to the real person. A person can have several pseudonyms, estab- 
lishing a different virtual person for different services. Persistency means that 
one pseudonym is typically used multiple times. On this basis, one party can 
remember the other party (and e.g. personalize its service). Login names at free 
Internet storage providers are pseudonyms. 

^ Note that these definitions of identity are not as awkward as they might seem (Roger 
Clarke — whose ideas on identity fP are borrowed here — puts them in contrast to 
the practically accepted concept of identity which is rather the concept of identifier). 
But what does actually happen during authentication (of identity)? It is checked 
whether a party is really the one it claims itself to be, i.e. whether it is identical 
with the party whose identifier has been presented 
^ The word nymity most probably originates from Lance Detweiler |2] and has been 
widely used in privacy context, as well as the word nym. 

® The definition of verinymity is taken from Ian Goldberg’s Ph.D. thesis Q ; the distinct 
notions of persistent pseudonymity and linkable anonymity are not taken along, 
however. 

For a more rigorous definition of anonymity and pseudonymity, see the paper by 
Pfitzmann and Kohntopp |3]. 
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Finally, in case of anonymity, no persistent name is used. An anonymous 
communicating party cannot be remembered. It is also known as unlinkable 
anonymity. 

Note that the concepts of nymity and identity are strongly related but should 
not be mixed. The concept of real identity and verinym are so close to each other 
that they can be treated as synonyms. But more important are the differences 
between identity and nymity. Regardless of whether the session (or transaction) 
is verinymous, pseudonymous or anonymous, the communicating parties do have 
identities; I know that I have sent the message to X and not to Y, even if I don’t 
know anything more about X. Nymity is rather an attribute of the identity: it 
can be anonymous (the real person behind X tomorrow can very well be different 
from the one today), pseudonymous (the person behind X tomorrow is probably 
the same person as today) or verinymous (the real person behind X can be found 
easily) . 



Personalization and Profile. In order for a service to be personalized, it needs 
information about the user. We call this information the user’s profile. Based on 
the CPExchange Specification | 5 ], profiles typically contain the following types 
of information: 

— demographics (gender, birth date, number of dependents, income level, etc). 

— activities (hobbies, occupations, etc), 

— interaction history (traces of sessions such as web site session, etc), 

— preference information (either explicitly provided, or inferred from past be- 
havior) . 

The following types of information are also treated as part of customer pro- 
file by CPExchange; they should, however, be treated separately as information 
closely related to identity (each of them can be either verynymous or pseudony- 
mous): 

— identification information (identifying numbers, names, etc), 

— contact Information (postal address, telephone number, e-mail address, web 
site, etc), 

— payment information (accounts with financial institutions and credit cards). 

Note that the borderline between identity and profile is not sharp but actually 
blurred. Two or more different pieces of the same profile, although not containing 
any identity information, together could lead to the real person. Imagine for 
example, that a user’s postal zip code is disclosed in one session and the birth 
date in another session. Neither of the two data elements, in itself, tells too 
much. But if one knows that they belong to the same person, then the user’s 
identity can be found out given a demography database at hand 0. This is 
called triangulation. 
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Case 1 ) 



Case 2) 



Case 3) 



Case 4) 



Case 5) 




e.g. on-line banking 



low-risk services 

e.g. inet storage, forum, 

MusicStore, restaurants, ... 

pseudo/anonymous 
trading, auctions 



normal (Web) browsing 



c2c chat 



Fig. 1. Nymity relations 



2.2 Nymity Relations 

Communicating parties can be in different relations with respect to nymity, 
as Fig. [n showqj (v = verinymity, p = pseudonymity, a = anonymity, p(v) 
= pseudonymity with a hidden but trusted verinymity, also called as certified 
pseudonymity) . The letter on one end of the association between the client and 
the service denotes under what nymity the communicating party on this end is 
seen by the party on the other end (v, p, . . . ) . For example, it is hard to imagine 
that the relation between a bank and a user is other than verinym-to-verinym, 
as Case 1) illustrates. The two most interesting cases are Case 2) and 3). 

In Case 2) , the service sees the user under a pseudonym (usually a name which 
uniquely identifies the user within the service, but has no meaning outside of 
it). For highly personalized services but with no money involved, this level of 
nymity is sufficient (and also desirable from privacy point of view; cf. Principle 
Of Minimum Disclosure) . 

In Case 3), the service again sees the user under a pseudonym. The difference 
is that there is a hidden verinym also; somebody guarantees the service that it 
is possible to get to the verinym if needed, e.g. if the user does not want to pay 
or violates a contract (but not in “normal” cases). That is, there is a nymity 
and a hidden nymity (certified) . This combination allows for pseudonymous — 
or even anonymous — personal services involving money transfer, e.g. purchase 
or auctions. Note that a(v) could also be used instead of p(v). 

2.3 Profile Disclosure 

Disclosing a user’s profile, or part of it, is an extremely critical operation. Special 
care must be taken of what part of the profile, to whom, for what purpose, for 



® All figures in this paper use the notations of the UML |2|. 
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how long a time, with what usage rights etc. is given out. Techniques like the 
authentication of the requestor, determination of the purpose, signing of legally 
binding (digital) contracts support profile disclosure. 

Note that a complicated enough profile, even if not containing direct identity 
data, is a pseudonym 0. That is, if (the disclosed part of) a profile is individual 
enough, then it may serve as basis for linking. Because of this, special care has 
to be taken of what part of the profile is disclosed in a given situation [D| . 



2.4 Ideal and Practical Definitions of Privacy 

Based on the notions of true identity and linking, we are able to give an ideal 
definition of privacy: 

A system is privacy-enabled if no party is able to link data about a user to 
its true identity or its other data, unless the user has explicitly allowed it. 

The definition involves the following characteristics of such a system: 

— Enforcement by pure technology (as opposed to policy). 

— Opt-in always. “Gravity” attracts towards total anonymity: if the users do 
not do anything special, they remain anonymous. This is in line with the 
natural attitude that people prefer not to reveal their identity unless there is 
a purpose of it (concerning explicitly given information as well as information 
collected automatically in a more secretive manner). 

— Users are in full control over the linkability of their data. 

The above definition refers to an ideal point that we might never reach. First 
of all, it might prove technically infeasible to provide enforcement of privacy in 
such a system, meaning that the possibility for abuse remains. Nevertheless, it 
can serve as a “reference idea” to which we can always compare our real solutions 
and analyze what we have sacrificed. 

A more practical definition can be given if pure privacy technology can be 
accompanied by privacy policy: 

A system is privacy- enabled if no party is able to or has the right to link data 
about a user to its true identity or its other data, unless the user has explicitly 
allowed it. 

It is desirable to incorporate privacy technology wherever possible, and poli- 
cies/rules are to be used for eliminating all remaining privacy threats that tech- 
nology cannot address. 

3 Building Blocks 

This section is devoted to identifying privacy-related functionalities and assign- 
ing them to conceptual entities. 
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Fig. 2. Identity Broker 



3.1 Identity Broker 

In the center of Fig. El there is the conceptual entity Identity Broker. Its respon- 
sibility is translating identity i.e. to hide the user’s real identity (verinym) from 
the Services and expose a pseudonym instead or keep the user in full anonymity 
towards the Services. Identity Broker is a trusted party for the user. 

There is an additional association in the figure, the one directly connecting 
the Client with a Service. It illustrates that an Identity Broker is not always 
needed for the sake of pseudo/ anonymous communication. There are cases when 
it is needed by the user: e.g. for maintaining the different pseudonyms for the 
different Services. Note that Identity Broker might remain just a conceptual 
entity in this case, since it can be a simple mapping between pseudonyms and 
service addresses in the user’s terminal; but it can also be provided by a trusted 
party in case of a dumb terminal. There are cases when the Identity Broker 
is needed by the Service: e.g. when the Service needs to trust the user (when 
making a contract say); in this case, the Identity Broker can assure the Service 
about that there is a real identity behind the pseudonym, although not revealed 
(cases p(v) and a(v) in section ^21) • And there are other cases when the Identity 
Broker is not needed, e.g. in case of pure data exchange without a need for trust. 

Note that identity translation is related to all network layers. If an identity 
translation is taken place on the application layer, then at least the same or 
stronger translation should be done on lower layers, otherwise the lower level 
identifier may unwantedly reveal the identity of the user. For example, it might 
mean IP address anonymization on the connectivity layer and MAC address 
tumbling on the access layer. For such a technique, see the classical paper of 
David L. Chaum mi. 



3.2 Profile Broker 

The responsibility of Profile Broker is to give access to user profiles in a controlled 
way (create, request, edit and delete). Yet again. Profile Broker is a conceptual 



Privacy Enhancing Service Architectures 105 



entity which can serve a number of purposes. One purpose can be pure storage 
(either on the terminal or as a network service). Another purpose can be ag- 
gregation: different parts of the profile are stored at different places the broker 
being responsible for the simple access. Yet another purpose can be caching (to 
give access to roaming profiles say). All in all, Profile Broker is the access point 
to profiles. 



ProfileBroker 



Fig. 3. Profile Broker 

3.3 Disclosure Contract Broker 

Although strongly related to Profile Broker, a separate (Disclosure) Contract 
Broker is worth considering. The two brokers may be located at the same place 
or at separate places. The Contract Broker is responsible for: 

— letting the user choose the privacy contract describing the terms related to 
the disclosure of his/her profile data, 

— negotiating the contract terms with the receiving party (typically the service 
provider), 

— having the contract signed by the receiving party (before profile disclosure), 

— storing the signed contract for future reference and dispute resolution. 

In case of having the Contract Broker separately from Profile Broker, the 
Service first negotiates with the Contact Broker on a suitable contract, as Fig. 0 
illustrate^. Then it contacts the Profile Broker for the required profile, with the 
valid contract in hand. 



ContractBroker 



Fig. 4. Contract Broker 



3.4 Authenticator 

The responsibility of Authenticator is to ensure a party that another party is 
really the one it claims itself to be. The result of the authentication can be 
communicated by means of authentication tickets. One promising technology 
for this is SAML which is currently under development; actually, S AML- 
coded assertions could be used in many places (e.g. as access tickets, as opt-in 
confirmations, as disclosure contracts). 

® Since UML does not support data flow modeling (at least not in a trivial way), we 
use the following convention: instead of showing message names on the collaboration 
diagrams, the name of the data (or activity if that suits better) is put into a UML 
note. 
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Fig. 5. Profile disclosure supported by contract 



Authenticator 



Fig. 6. Authenticator 



3.5 Privacy Control Panel 

Users should have access to a Privacy Control Panel through which they are able 
to control their privacy preferences. This entity, although it is an essential part 
of a real solution, is not shown in the rest of this article. 



3.6 Related Material 

Markus Jakobsson and Moti Yung have also worked out similar conceptual en- 
tities for the World Wide Web. They presented an assurance framework which 
covers the commercial aspects as well. The sophisticated reader can find their 
paper at m 

4 Architecture Variations 

By deploying the conceptual entities listed in the previous section to different 
subsystems (parties), we actually get to different possible architectures. 



4.1 Pure Policy 

For the sake of completeness. Fig. Q illustrates the solution most widely used 
nowadays. The user’s profile (Customer Relationship Management data) is main- 
tained by the service. Rules on how the profile can be used by the service is 
controlled by policies, set up by the service, which the user may or may not 
accept and the service may or may not obey. 
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Client 


I v/p/a 







Service 



Authenticator 



ProfiieBroker 



Fig. 7. Pure Policy 

4.2 Central Privacy Service Provider 

Putting all the privacy-related responsibilities into the hands of one party called 
Privacy Service Provider (PSP), we get to a model that could be implemented 
even now, as Fig. 0 illustrates. Since mobile operators are trusted parties for 
both users and Services, they are a natural choice for running PSP. 

The user’s profile and privacy preferences are maintained by PSP (0). When 
connecting to a Service, the user logs on not to the Service directly but to PSP 
(1). Then PSP authenticates the user for the Service by means of an authenti- 
cation ticket (2). Single sign-on can naturally be implemented this way. Then 
if the user’s profile is required by the Service, PSP discloses it according to the 
user’s preferences (3). 

Note that for sake of simplicity. Contract Broker is not shown in this and 
further figures; it can be thought of as if its responsibilities were delegated to 
Profile Broker. 
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Fig. 8. Central Privacy Service Provider 



4.3 Trusted Mobile Terminal 

It seems to be possible that both identity translation is run on and profiles are 
stored on the terminals, even on mobile terminals, as Fig. O illustrates. The fact 
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that the users can carry all the privacy-sensitive information physically with 
them everywhere they go would give a strong feeling of privacy. Note that the 
emphasis is on the terminal in the figure. The Authenticator may or may not be 
a separate party from the provider of the Service. 
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Fig. 9. Trusted Mobile Terminal 




Fig. 10. Physical separation of identity and profile 



4.4 Physical Separation of Identity and Profile 

By giving out profile storage and identity translation to different parties, we can 
come up with a highly distributed architecture as Fig. E3 shows. 

This solution would be advantageous for users with relatively dumb terminals 
or for users preferring not to have private information with them (e.g. they 
frequently switch between using a mobile terminal and a PC for accessing the 
Internet). The Service can get access to the user’s profile by first obtaining an 



Privacy Enhancing Service Architectures 109 



access ticket from the Client (3) and then sending it to the Profile Broker (4) 
which then discloses the profile (5). 

5 Conclusions 

We presented a conceptual framework for designing privacy enhancing architec- 
tures for personalized services. Following the principle of separating identity and 
profile, we identified the critical functional elements such as Identity Broker, Pro- 
file Broker, Contract Broker and Authenticator. Then, using these conceptual 
entities as building blocks, we showed how they appear in different architecture 
solutions. 
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Abstract. In this paper we propose a method to prevent so called “in- 
tersection attacks” on anonymity services. Intersection attacks are pos- 
sible if not all users of such a service are active all the time and part 
of the transfered messages are linkable. Especially in real systems, the 
group of users (anonymity set) will change over time due to online and 
off-line periods. 

Our proposed solution is to send pregenerated dummy messages to the 
communication partner (e.g. the web server), during the user’s off-line 
periods. 

For a detailed description of our method we assume a cascade of Chau- 
mian MIXes as anonymity service and respect and fulfill the MIX at- 
tacker model. 



1 Introduction 

1.1 Overview 

Starting with a description of the attacker model and the nature of intersection 
attacks in section 0 we present the problem and motivation. 

Section El lists the basic approaches to prevent these attacks and from section 
E2I on we concentrate on the dummy traffic approach and different ways to 
deliver dummy messages. A summary of the requirements for a system that 
resists a global attacker concludes this section. 

In section 0 we present a system that fulfills this requirements but does so 
with a high demand on resources and only for a limited set of applications. 

Section 0 summarizes the results and points out areas where further research 
is needed. 

1.2 Motivation 

Anonymity on the Internet has been an issue ever since its focus shifted from 
scientific to social and commercial uses. Without special precautions most users 
and their actions on the Internet can be observed. To address this threat for 
privacy, various anonymizing services have been built, but especially real time 
services are vulnerable to traffic analysis attacks 0. 

In this paper we present methods of sending and storing dummy traffic to 
prevent such attacks. 
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1.3 Mixes 

For the most part we will consider the anonymizing service a black box. A set of 
input messages is received by that black box and transformed to a set of output 
messages. Each incoming message will be anonymous among the set of outgoing 
messages. 

For particular properties, like a determined delay of messages, we will assume 
a cascade of Chaumian Batch MIXes 0 or Web MIXes 

1.4 Terms and Definitions 

A sender or receiver is said to be active if he sent or received a message during 
a period of time. 

For each observed message an attacker may construct sets of possible senders 
and/or recipients, excluding users who at that time where off-line or idle or oth- 
erwise unable to send or receive that message. This sets will be called anonymity 
sets. 

In lack of a better word, session will be used to describe a series of messages 
that belong to a communication relation. Contrary to the usage in network 
literature, this relation may continue over several online/off-line periods. 

Sessions are considered active if at least one message, that will be linked to 
that session was transmitted during a period of time. 

A Session, for instance could describe the messages that somebody sends to 
post messages to a political forum under a pseudonym. 

A different example would be the requests that a user makes to check a 
stock market service. Provided that the requests reflect that users portfolio in a 
sufficiently unique way, there doesn’t need to be a pseudonym to recognize him. 

1.5 Attacker Model for Intersection Attacks 

When talking about the security of MIXes, often the global attacker or adversary 
is presumed. To start an intersection attack, a somewhat weaker attacker is 
sufficient, but there are certain other conditions that have to be fulfilled for an 
intersection attack. 

Figure Q] shows his essential areas of control. The darker areas are the mini- 
mum that an attacker needs to watch, while the lighter areas show the area that 
a global attacker can watch and manipulate. 

— The attacker is able to watch all user lines. 

If the attacker could access information that allow a mapping of IP-addresses 
to users, he just needed to watch the entry point of the cascade instead of all 
user line^. 

— He is also able to watch all server lines. 

In a MIX cascade there is only one exit line. Watching that line is sufficient 
to see all requests that are sent to a server through the cascade. Especially 
when no end-to-end encryption is used. 



^ Presuming no direct Peer-to-Peer connections are made. Like in Crowds^ networks. 
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Server 0 




Fig. 1. Attacker model for intersection attacks 



— He controls up to m — 1 MIXes and a small number of users. 

Though the control over MIXes is not technically needed for an intersection 
attack, it is important to keep this in mind because the counter measures 
should withstand the global attacker model. 

— The attacker knows how long it takes for a message to go through all the 
MIXes, so he can infer when a message that leaves the cascade has been 
sent. 

This is certainly true for batch MIXes and Web MIXes. These parameters 
should even be known to every user. At the very least the attacker could send 
messages through the system himself to measure the current delay. 

— Messages belonging to a certain session have to have some property that 
allows an attacker to link them to each other. 

This is of course the most fundamental precondition. Otherwise each message 
would have an independent anonymity set, consisting of the messages that 
where processes by the MIX cascade in the same time period. 



1.6 Intersection Attack 

Intersection attacks can be launched on two different scales. They differ in the 
time an attacker may need to link a series of messages to a user and the criterion 
used to decide if a user is possibly the sender of a message. 

By watching the lines between each user and the cascade’s first MIX, an 
attacker can see when the user is active and when he is idle or off-line. The 
attacker also watches all messages M that leave the cascade. Some of these can 
be linked to previous messages and recognized as belonging to a session s. This 
reveals when a session is active or inactive. 



1.7 Short Term Attacks 

Consulting the information he gathered from watching the activity on the users’ 
lines, he can create a set of possible senders Um for each message. Taking all 
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messages that belong to a session s and intersecting the sets of possible senders 
will remove users who where not active during the transmission of all messages. 
This shrinks the set of suspects S possibly until there is only one user left0. 

71 

S' = Pi Um! 

i=0 

If users are not active all of the time, the probability for any non-involved 
user to be sending messages whenever a message for s is sent, rapidly decreases 
over time. Even if each user is active 90% of the time, the anonymity set will be 
down to roughly a third after just 10 messages of s. 

While the mathematical concept of intersections is nice for explaining this 
attack, it does not reflect the reality. An attacker who used plain intersection 
on the sets of possible senders could be fooled into removing the real user by 
a single message that he erroneously linked to a session while the session’s real 
user was inactive. 



1.8 Long Term Attacks 

Long term attacks are possible if messages can be linked across more than one 
online period. In long term attacks the criterion for including a user in the 
anonymity set of a message isn’t user activity anymore, but session activity. All 
online users are added to the set of possible senders. In this case an attacker 
doesn’t need to accurately measure latencies and he doesn’t need to watch ac- 
tivity closely. 

The user sets of those periods will probably be very different and therefore 
the intersection attack will become more likely to succeed. 

2 Counter Measures 

In this section we present the two basic approaches to preventing intersection 
attacks. 



2.1 Preventing Linkability 

The most effective way to prevent intersection attacks would be, to make mes- 
sages unlinkable. Unfortunately this is not that easy. 

Linking by traffic data, can be prevented to a certain degree. Message sizes 
can be padded to a common multiple and larger messages can be split into 
smaller fragments but already at this point the number of transfered packets 
that enter and leave the anonymizing service can be correlated by an attacker 

^ Recognizing a gronp instead of a single user may be of interest, too. Especially from 
the viewpoint of indnstrial espionage. But for the purpose of perspicuity, we will 
stick to the one-user-scenario. 
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who watches all input and output lines. Send and receive times are equalized 
by collecting messages to form a batch or by delaying each message for a user 
selected time. This works for email, but for low latency services like web browsing 
there are rigorous limits to the delay that can be added. Longer delays will not 
be accepted by most users and therefore the anonymity group will shrink 

At the first glance, content based linking seems easier to prevent. Simple 
End-to-End encryption would protect messages from eavesdroppers. If however 
the communication partner is not to be trusted or doesn’t even provide means 
of encrypted communication, the situation changes dramatically. 

An example of this is again web browsing. There are countless ways by which 
requests to web servers can be linked. And even if all traffic to and from the web 
server was encrypted, this does not protect from the server’s operator. Appendix 
o lists some of the methods that can be used to track users on the WWW. 

2.2 Simulating Activity 

If there is no way to make messages unlinkable, the obvious solution is to make 
all users potential senders or receivers. This can be done by sending dummy 
traffic. 



Short Term: Against short term attacks it is sufficient to transmit dummy 
messages between the users and the anonymity service. The dummy traffic con- 
ceals periods of inactivity by contributing messages when no real traffic is to be 
sent. This ensures that each user is in the anonymity set of each message for an 
active sessioij^ 

If only one MIX is trustworthy, we have to assume the worst case, that it is 
the last one. Hence the dummy traffic has to be sent through the whole cascade 
and can safely be dropped by the last MIX. It does no harm if the dummy 
traffic passes through the decent MIX, and is later discovered by the attacker to 
be just a dummy, because the trustworthy MIX will stop any attempt to trace 
the dummy back to its origin. 



Long Term: Long term attacks can’t be overcome by dummy traffic to the 
MIX cascade. The attacker can afford to include all online users of the anonymity 
service and it is very hard to hide the fact of being online and using an anonymity 
service. 

Therefore, instead of simulating constant user activity, the session will have 
to appear active all the time. Even while the session’s real user is off-line. 

The dummy traffic mentioned from here on, will have to look like normal 
user-generated traffic and consequently will have to be sent, like normal traffic, 
all the way to the server. 

® But even if the number of users stays more or less constant, the anonymity set for 
the whole session gradually shrinks. As the session continues, some users will leave 
the system and others will join it. These users can’t be responsible for an ongoing 
session. 
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The generation of such dummies will be discussed later. For now is is assumed 
that dummies are generated by the real user to be sent later by somebody else, 
while the real user is not online. 



2.3 Building a Global Anonymity Set 

Making all decent users appear equally likely, to be owners of a session is the 
ultimate goal in terms of the anonymity set. For real-time services like HTTP, 
this requires constant activity of all users, and therefore constant connectivity. 
Otherwise the first message that belongs to a session s, also specifies the starting 
anonymity set Mg, consisting of all users who are currently online and active. 
This will give the attacker a substantially smaller set of suspects to start with. 

If some restrictions on the real-time property can be accepted, there is a way 
to expand the limits of that initial anonymity set MJ. Once, this is achieved, 
other measures can be used to keep the anonymity set at that size. 

To overcome the initial limit for the anonymity set, the traffic for a session 
would have to start before the real user even came online and used the session for 
the first time. Generating a series of plausible dummy requests without having 
visited a web site before, may look difficult, but HTTP-to-Email services like 
wwwdmaiEI or agorjl could easily be adapted to use non-real-time channels like 
MixmastersfZI- Using them to acquire a document from the WWW would be the 
first step to generate a set of dummy requests. These requests could be sent like 
normal dummies (as described below) and would serve as a fore-run to mask the 
real beginning of usage. The average fore-run period should be long enough for 
all users to be online at least once, so that each of them could have initiated the 
session. 



2.4 Maintaining the Anonymity Set 

An almost completely self reliant solutions could be, to send long running MIX 
messages, not unlike emails in Stop-and-Go MIXes0. But having the anonymiz- 
ing service simply extended to also handle “slow” traffic however, would not be 
enough. The dummy would have to be indistinguishable from normal traffic, at 
least after it passed the one trustworthy MIX. In this situation, the only position 
where a MIX could offer protection to the nature of a dummy, would be, as the 
very last MIX in the cascade. 

Also for Web MIXes and HTTP access, there needs to be a bidirectional 
connection, not just a one way delivery. It seems necessary to delegate the ac- 
tual sending of the dummy to some other entity. Dummies therefore have to be 
transported from the real user to whomever is intended to send them later. If 
an attacker observes a dummy somewhere during that transfer, he will be able 
to recognize it when it is sent through the MIX cascade later. If the transfered 
dummy was already encrypted for sending, an attacker wouldn’t learn anything 

http : / /www4mail . org/ 

® http : / /www . eng . dmu . ac . uk/Agora/Help . txt 
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about its plain text but he could recognize it when it is sent to the MIX cas- 
cade and by whom. The attacker could deduce that this user was currently idle 
and remove him from the anonymity set of each message that was sent at the 
same time. Therefore dummies should not be transported in plain text and even 
encrypted dummies need to be hidden from an attacker if possible. 

The attacker model puts users in a grey area. A few of them may be in 
collusion with the attacker while most of them are trustworthy. In respect to 
this, it will be considered save, if a single user, or a very small group of users, 
learns about the the nature or even the content of a dummy message. It will be 
considered insecure however, if a larger group, e.g. the whole anonymity set of 
a message, could find out that a message was just a dummy. 

Dummy Sending Service. The first simple approach is a dummy sending 
service, that is informed by a user who wants to protect his session s. That 
dummy service will have to be provided with enough information to produce, or 
just forward, dummy messages for a session. 

To avoid flooding of that service, without identification of its users, blindly 
signed tickets|3 for dummy traffic can be issued by that service. Also, dummy 
sending could be paid for in advance with an anonymous payment system. 

Any communication with the dummy service, will have to be asynchronous. 
Otherwise the anonymity service itself will know, when the real user of s is 
online, and will therefore be able to start an intersection attack by itself. 

For services with a small number of randomly distributed requests per day, 
this service can protect the anonymity. But only as long as the number of requests 
that the real user adds to a session during his online period is not statistically 
significant. 

For services with more traffic, there has to be a way to regulate the flow of 
dummies when the real user wants to use the session. Otherwise the real user’s 
messages that belong to s will just add to the dummy traffic and clearly mark 
the online period of its sender. 

Even a widely variable amount of dummy traffic will not help much. Given 
enough traffic/time pairs, the attacker will always be able to find out the real 
user. Comparing the average traffic of s during each users online and off-line 
periods will reveal the source of that additional traffic. The user with the highest 
difference in those averages is the probably the real user of s. 

Telling the service, to pause sending dummies for s is highly dangerous. Even 
if communication to the service is totally anonymous, it will deliver half the data 
that is necessary for an intersection attack directly to that service. 



Sliding Storage. To avoid trusting a single server with the storage of dummies 
and to improve the chances to revoke a dummy, a slow message forwarding 
system was designed. In this system a message doesn’t stay at one server but is 
repeatedly handed around. 

It consists of several storage nodes S that roughly resemble a network of 
Stop-and-Go MIXes. In contrast to MIXes, who push messages to the next MIX, 
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storage nodes show a pull-behavior. Messages are requested by the next storage 
node instead of being sent to it automatically after a specified time. Those 
requests are only answered if the asking party can prove some knowledge of the 
message, but without revealing the identity of the requesting party. To achieve 
this, communication between the nodes could be run over a normal MIX cascade. 

The way of each message through the set of storage nodes So...n is chosen by 
its creator and the message is encoded with the public key of each node along 
the way. Like in a normal MIX network the message is encrypted for the last 
node first. 

Every node Si along the planed route gets a short note with the information 
that it needs to request the message from its predecessor Si-i. This includes 
the time at which to request the message as well as the message’s hash value, 
to proof a request’s validity, and of course the address of the predecessor. The 
delivery of these notes could be left to some asynchronous transport system to 
ensure that no node gets the request information too far ahead of its time. 

While a message is stored at Sj , the creator of that message has the chance 
to intercept it before it reaches its goal. Knowing the current position and hash 
value for each message that he sent, he could act like he was node S'j+i and 
simply request it. This empowers the creator to revoke his own dummies without 
revealing his identity. 

To avoid suspicion at Sj and S'j+i, repeated requests should be common as 
well as constructing request chains that are longer than the actual message path. 

Though this system seems to be fairly save against attacks from outsiders 
and single nodes, there are problems when nodes collaborate. 

Adjacent nodes along the path could inform each other about requests that 
they sent/received and by this, could detect the revoking of messages. 

A malicious node could also refuse to delete an already requested message and 
hand it out if it was requested more than once. Though this could be detected by 
the user if he sent requests to nodes that his message should already have passed. 
Trustworthy nodes could also do this check by requesting a message more than 
once. But if such behavior goes unnoticed, the message will reach its goal. This 
will have an effect on the average number of messages as shown before. 

If Sj and 5'j+i were both corrupt, they could infer from their combined 
knowledge that the user intercepted a message. Hence, no adjacent nodes may 
be corrupt. This is of course a much stronger assumption than the original MIX 
model where only one trustworthy node is needed in the whole message path. 

2.5 Requirements 

Summarizing the results from examining these approaches here are the general 
requirements for a system that stores and sends dummy messages. 

— Distributed trust. 

Assuming the global attacker model, the adversary controls most (in the 
extreme case all but one) server nodes and a small part of the client nodes. 
To build a trustworthy system with these components we have to distribute 
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our trust. This could be done using protocols that guarantee the security by 
involving at least one trustworthy node in all security-critical computations. 
A different approach is to limit the number of involved users and estimate 
the risk. 

— Limited DoS capabilities. 

Denial of Service attacks can not be prevented when the cooperation of 
possibly corrupt nodes is needed to achieve a result. The best thing we can 
hope for is to expose the uncooperative nodes. On the other hand a node 
that intentionally doesn’t answer is indistinguishable from a node that fell 
victim to an active attack. If cooperation is not an issue, the effect of DoS 
attacks on the system should be confined. Especially flooding attacks should 
be prevented, where the attacker tries to maximize his share of resources in 
order to simulate a larger anonymity group or to isolate other users. 

— Traffic shaping. 

The traffic that an observer sees for a certain session should be independent 
of the users online periods. This does not only mean that dummy traffic 
should be sent during the users off-line period but also that the amount 
of traffic should not be higher during his online periods. A user therefore 
should be able to revoke his dummies or replace dummy messages with real 
messages without an attacker noticing it. 

3 Limited Solution 

We propose a system for transporting dummies that should fulfill the previously 
listed requirements. 

It is limited to simple non-interactive information retrieval. Access to web 
pages via basic HTTP authentication is possible but any challenge-response 
based access control will fai0. 

The bandwidth demand is rather high as is the number of asymmetric cryp- 
tographic operation that have to be performecQ. 

3.1 Passive Dummies 

To handle the full variety of tracking methods, dummy requests will have to be 
more than just plain HTTP requests. For some tracking methods however, sim- 
ple, even repeated, requests will be adequate. These passive dummies just contain 
the requested URL, and all HTTP headers, including cookies, web browser iden- 
tification and every other header that the real user would have sent himself. If 
a dynamic page is requested via the HTTP “POST” method, the data that the 
real user would have submitted, also has to be part of the dummy. 

Passive dummies could be automatically generated by the local proxy with 
very little help from the user by recording repeated requests and asking the user 
to approve their usage as dummies. 

® This is no limitation of the transport system though, but a limitation of the trust 
that one is willing to put into an randomly selected user. 

^ Both can be reduced to a certain degree as shown in appendix [01 
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The dummy could consist of several prepared requests, meant to be transmit- 
ted one after another to emulate the automated loading of embedded images and 
objects. This is of course only possible if those responses are known in advance. 
Intermediate changes to the requested page may result in requests for images 
that are not embedded anymore but this is not that dangerous. Some browsers 
already use a cached version when re-requesting a document to also re-request 
embedded images even before the updated version of the document is returned. 



Prefabricated Dummies, are dummies that are build and prepared for send- 
ing by the real user. This means they are already encrypted. Like the long run- 
ning dummy messages they are appealing since they don’t seem to need much 
faith in systems outside the users trusted region^. The contents of those dum- 
mies is only revealed after they have passed the cascade and thereby at least one 
trustworthy MIX. 

The considerable advantage is that since the dummy sender doesn’t need to 
react to the servers response, he doesn’t need to understand it. So the returned 
data doesn’t need to be intelligible to him and he doesn’t learn what he requested 
at all. 



3.2 One Way Dummy Database 

The dummies are stored in a distributed database. It works similar to the mes- 
sage service from p], but unlike there, the user requests only data from a single 
address and the data in each cell is scrambled. Only the XOR combined result 
of each cell’s contents at the address I from all n database nodes will reveal the 
stored data. Instead of hiding the address that is accessed from the database 
servers, the data itself is hidden. 

To make sure that no dummy is given to a decent user and a rogue, the 
database nodes are obligated to issue each cell’s content only to one user. Due 
to the XOR scrambling it only takes one trustworthy database to enforce that 
rule. 

Building data fragments for a system with n-nodes is done by generating 
n — 1 blocks of random data and XOR-combining all those blocks and the real 
data, to form the last block. 

Presuming the attacker’s presence in most parts of the system it seems best 
to give the dummy to just one randomly selected user. If flooding attacks can 
be prevented the chances of giving it to a corrupt user then would be minimal. 

Having the user request dummies, instead of sending them, makes more sense 
in terms of performance, though. The user (or his local proxy) knows more about 
his current bandwidth demands and wouldn’t send dummies if he needed all his 
bandwidth for himself, anyway. Hence, instead of selecting the user randomly, the 
dummies are encoded so that they all look like random patterns until requested. 

® Except for sending them at an appropriate time, which conld be tested by submitting 
some dummies whose sending can be checked by the user. 
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Traffic Shaping. Each database node publishes a list of cell addresses and hash 
values computed from that cells contents. Normal users randomly select one of 
these addresses from that list and request the contents of that address from each 
database node. The real user can recognize his own dummy by the hash values 
of each fragment and request that dummy. This allows for “revoking” of one’s 
own dummies and solves the problem that dummy traffic and real traffic add up 
during the real user’s online period. 



Preventing Flooding. Accessing the database is done through an anonymous 
channel, like a MIX cascade, to hide the depositing and withdrawing user from 
the database nodes and other observers. 

Since all database nodes should give the fragments of one dummy to the same 
user, there has to be some pseudonym, by which the databases can recognize him. 
Using blindly signed tickets, that are only valid for one deposition or withdrawal, 
makes the actions unlinkable and prevents correlation between depositing and 
withdrawing users. 

A ticket contains a public key of an asymmetric cryptographic wstem that 
will be used to verify the ticket holders signature in a later message 

To create such a ticket the user first generates a normal asymmetric key pair. 
The public key is extended by some redundancy, blinded and sent to each node. 
The signed replies will be unblinded and the public key with the redundancy 
and all signatures are combined to form the ticket. 

If this ticket is used, to sign a request later, the database node tests its own 
signature of that ticket and the tickets signature of the whole request. 

Leaving it to the user to request dummies, might open an opportunity for 
corrupt users, to drain or flood the dummy database. But if each user is only 
given a limited amount of tickets, the few corrupt users are not able to do signif- 
icant harm. Concentrating on dummies by a certain user or dummies addressed 
to a certain service is not possible. 

Malicious users could try to render up to n dummies useless with one request, 
by requesting the content of different cells for different nodes. To avoid that, the 
database nodes will have to inform each other of requested addresses. 



Authentication. All communication to the user has to be signed by the data- 
base nodes in order to verify the message’s integrity and to empower the user to 
prove misbehavior of database nodes. 

Furthermore all database nodes should continuously publish their log files so 
that their work can be reviewed by a third party in case of doubt. These log files 
will not contain any actual dummy data, but its hash values. 

All deposit and withdrawal messages that are sent to the database nodes 
contain a ticket and will be signed by the secret key that belongs to the used 
ticket. 

® If an symmetric cryptographic system was used, each database node could use the 
key to impersonate the user and deceive the other nodes. 
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When depositing a dummy for example, the message to the server is en- 
crypted with the server’s public key and signed by the key included in the ticket. 
Upon receiving a message, the server tests his own signature of the ticket and 
then the signature of the whole message, made with the ticket’s secret key. Then 
the server decrypts the message’s content. 

The server could sign its answer and send it back to the client, but as with 
any other handshake, there is no finite protocol that will totally satisfy the needs 
of both parties. Publishing the (signed) dummy list, however can be seen as a 
public statement of commitment. Therefore, at this point the server will include 
the new dummy’s cell address and hash value in its list. The client can check 
this, by requesting the servers dummy list (without any reference to that certain 
dummy or ticket). 

If a node refuses to accept a ticket, it has to show that this ticket was already 
used. To prove that, the node will have to keep all user messages as long as the 
used ticket would have been valid. 



Delayed Dummy Release. Assuming that only the real user of s deposits 
new dummies for s while he is online and new dummies are made available right 
away. As soon as he goes off-line and stops depositing dummies, the chances of 
randomly picking a dummy for s will decrease over time. This is because the 
number of available dummies will shrink with every one that is already sent, 
while other users deposit their dummies in the newly available cells. 




Fig. 2. Changes in dummy traffic for s. 



As Fig.EI shows, re-stocking the dummy supply for s will show up as a sharp 
rise in the number of requests for s and could be used to select the online users 
of such periods for an intersection attack. 

There are several possible solutions. The deposition of dummies could be 
made asynchronously with dummies that take more or less time to get to the 
database nodes. But this would disable the direct control mechanisms that the 
public dummy list provides. A server could deny having received a message 
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and the user would have no proof against it. The usage of receipts (sent to 
an anonymous return address) instead of public lists could also mend this, but 
would add even more delay. 

The database nodes themselves could take care of dispersing the dummy 
release over time. But they would have to be instructed about the time frame 
for each dummy by its creator, since they themself’s don’t know which session 
a certain dummy belongs to. 

Instead of allowing an individual time frame for each dummy, it seems more 
efficient to globally define time frames and to only allow deposition of dummies 
for future time frames. 

Evaluation. 

— Performance: 

Table Eshows the cost in messages and additional cryptographic operations. 
Direct messages are assumed to already be authenticated and indirect mes- 
sages are assumed to be encrypted with the receivers public key and sent 
through a MIX cascade or some other form of anonymizing device. Sending 
and receiving are counted likewise, even though the cost of processing can 
be very different, especially for indirect messages. 

The last two columns contain the number of additional asymmetric cryp- 
tographic operations (key generation, signing, testing, encryption, decryp- 
tion) that are needed, apart from the ones for handling of messages. Less 
demanding operations, like splitting dummies, blinding tickets and comput- 
ing/comparing hashes, are not included. 

• Requests for depositing and withdrawing are sent to each server sepa- 
rately. To stop users from requesting different cells on different servers, 
the servers inform each other of requests. The messages for that can 
easily be reduced to 2n — 1. 

• The requests for signing a ticket are made directly to each server, as is 
the request for the signed lists, to check deposition. 



Table 1. Resource demands 
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check deposition 
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Depending on the average frequency of dummy messages, the number of 
messages for storing and retrieving 24h of dummy traffic for a session is 
shown in Fig. El The number of database nodes is assumed to be 3. 
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Fig. 3. User messages to protect s for 24 hours. 



— Security: 

The system seems fairly save against passive attacks by nodes and small 
numbers of users. Active attacks by nodes are limited to denial of service. 
Delivering false data without risking detection is not possible unless a hash 
collision can be found. 

The usage of asymmetric cryptography in most parts of the system allows 
for delegation of messages in order to expose uncooperative nodes. If for 
example, a node refuses to deliver the content of a cell, the user could send his 
request to a different node with an appeal to act as a proxy. The proxy node 
could test the validity of the request by checking the signature and consulting 
its own log files for a previous request for that cell. After establishing the 
validity, the proxy would forward the request and, if the other node remained 
uncooperative, ask it for some proof. A proof could be either the request by 
some other user for the same cell, or a request by the same ticket holder for 
a different cell. 

4 Conclusions 

In this paper first the presuppositions of intersection attacks were presented and 
an specialized attacker model was introduced. Several methods to prevent the 
result of this attack have been examined. 

Among these methods “Sending dummy traffic to the server” was examined in 
detail and the results of these analysis have been incorporated in the design of a 
limited solution. This solution provides strong anonymity even in the presence of 
a global attacker. A distributed database is used to store pregenerated dummies, 
where a single trustworthy server is able to uphold anonymity. 

Another single server could stop the system from working, but the failing 
server can be identified by trustworthy nodes and a user without revealing the 
user’s identity. 

Even though no general solution was found, the discovered design principles, 
like “involving every node in decisions” , and “extensive public log files” seem to 
be important guide lines for further work. 
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Unfortunately the solution is rather costly in terms of bandwidth and compu- 
tational resources. But other examined methods don’t provide enough security, 
given the supposed attacker model or are too costly in terms of bandwidth. 

Relevant restrictions of other examined methods are: 

— Transferring dummies directly to each user who is supposed to form the 
anonymity set for one’s session, is too costly or even impossible if a large 
anonymity set is needed. 

— Delivering dummies through an additional “slow” MIX channel is not pos- 
sible for bidirectional traffic. 

— Leaving the sending or distribution of dummies to a central service is too 
risky since it introduces a single point of thrust. 

— The “sliding storage” network leads in the right direction by distributing 
trust to several nodes. But the involvement of each node is limited and so is 
the protection a trustworthy node can provide. 

All but the last method suffer from the lack of revocability of dummies. 

For weaker adversaries the solutions examined along the way, certainly pro- 
vide security. But switching to a weaker attacker model makes the usage of 
MIXes almost obsolete. 

Further research should examine the possibilities to reduce the computational 
and I/O load on the server nodes as well as different kinds of dummy messages 
(like the one mentioned in Appendix 0). Especially an extension to services that 
use challenge response protocols for authentication would be interesting. 
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A Example Dummy Database 

To illustrate the process, we walk through a small example with just one user 
and two servers A and B. 




direct / 
authenticated 



Server A 




indirect / 
anonymous 



Fig. 4. Small example system. 



Request Ticket. After generating a key pair St,tt the user adds some redun- 
dancy to the public key, blinds it with b and sends this signed ticket request 
message for a certain time frame to each server. 

b{U, r), frame, Suser(^(tj, r), frame) 

Each server checks if the user is allowed any more tickets for that certain time 
frame, and if so, he signs the blinded request with the key for that time frame 
and returns it to the user. 

r)), frame, sjbjtt, r), r)), frame) 

bjtt, r)), frame, sjbjtt, r)), frame) 

The user unblinds the key with b~^ and is now in possession of a valid ticket. 

1 S A frame (^t 1 ^) ) frame 1 



Deposit Dummy. The user generates the fragments Fa, Fb as described above 
and additionally computes hash values of them. The fragment is encrypted with 
the corresponding servers public key and is concatenated with the address a, 
and the ticket. This whole message is signed with the ticket’s secret key and sent 
to the server through an anonymous channel. 
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CA{FA),a,{tt,r), i^t ,r),St { ca {FA), a,{tt,r), (it , ?^)) 

CB{FB),a, {it, r),St{cB{FB),a, ftt, r), SBt,,^Stt,r)) 

The servers check if the cell at the given address is available, test their own 
signature of the ticket, test the ticket’s signature over the whole message, and 
finally send each other confirmation messages to make sure that the same address 
was selected by the same ticket holder at both servers. 

If all went well, they decrypt the fragment, store it at the given address and 
compute the fragment’s hash value to update their public list. 

Request List. The user simply sends a request for the list to each server and 
gets a signed reply, that contains the list of addresses and hash values. Comparing 
the hash values to the ones that he computed himself, he could now make sure 
that his own dummies have been. Of course the user would test the servers 
signature of the list as well. 

Withdraw Dummy. With another key pair Ct,dt the user requests a ticket 
and eventually receives it. 

{ct,r),SA 

frame^ {ct,r),SB 

frame^ {ct,r)\ 

After reading the public list, the user decides to to withdraw the dummy at 
the address o'. He builds withdrawal messages that contain the address and the 
ticket and are signed by the tickets secret key dt- These messages are sent to the 
servers through bidirectional anonymous channels. 

a', (ct, r), (ct, r),dt{a', {ct, r), (q, r)) 

a',(ct,r), {ct,r),dt{a',{ct,r), (c« , r) ) 

The servers check if there is data available at the given address, test their own 
signature of the ticket, test the ticket’s signature over the whole message, and 
again send each other confirmation messages to make sure that the same address 
was selected by the same ticket holder at both servers. 

If all went well, they take the fragment, encrypt it with the key Ct, sign it by 
their own public key and return it through the anonymous channel to the ticket 
holder. 

Ct{F'^),a' ,SA{ct{F'j^),a') 

For each returned message, the user tests the servers signature and decrypts the 
fragment. After making sure that the hash value for address a' from the server’s 
list, matches the actual data, he combines the fragments and obtains the real 
data. 

B Optimizations 

Servers could be trusted to forward messages on behalf of the user when possible. 
If errors occurred in such a system, the user will simply fall back to connecting 
to each server individually to find out what went wrong. 
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For some special cases, this slightly increases the number of messages for 
some servers but, it reduces the number that a user needs to send and receive. 

— The request for signing a ticket can be forwarded by one server to the next 
and sent back to the user by the last one with the collected signatures. 
Since the request is rather small and the signatures don’t grow the message 
that much, it is considered better to forward the whole request instead of 
contacting each server separately. 

— When withdrawing a dummy, one server could be asked to collect all frag- 
ments and return the XOR combined result. To stop that server from seeing 
the actual dummy data, each node would not encrypt the fragment with the 
ticket’s public key, but combine the fragment with a pseudo-random stream 
of the same length and return that masked fragment and the initialization 
parameters for the random number generator in an encrypted block. 

The collecting server would XOR combine all fragments and append the 
encrypted blocks. Instead of returning n times the fragment size, only one 
fragment is returned and n blocks of the asymmetric crypto system. 

Upon receiving this message the user decrypts the n asymmetric blocks and 
extracts the initialization values for the random number generators. By over- 
laying all n pseudo random streams the real data is restored. 

This optimization breaks the provability of misbehaving of nodes but if errors 
occur, it is possible to fall back, to requesting fragments from each server 
individually 

— A request to deposit a dummy could be sent to one randomly selected server, 
containing all fragments. Each of those encrypted with the respective server’s 
public key. The selected server would split this large message into smaller 
parts, and forward those parts to the other servers. 

These smaller parts would have arrived from an unknown source at the 
servers anyway, and are verified by the layers of cryptography that encloses 
their content, not by their apparent sender. 



C Tracking Methods on the WWW 

Under normal circumstances the task of an anonymizing service is to prohibit 
the linking of requests by removing as much “dangerous” information from them 
as possible. For some applications, however it is essential that messages can be 
linked for the duration of a session or even longer. Here are some examples of 
tracking methods and sources of information that can be used to link requests 
over a long time. 

— The requested URL may contain an ID in several places. The host name, 
path, filename and as an CGI parameter behind the filename. The ID is 
carried over from request to request, either by being included in every link 
in the HTML document, or by being added to the URL by the browser 
when requesting some relative URL. Since these IDs are hardly ever human 
readable they only tend to stick to a user if he puts a URL with an embedded 
ID in his bookmarks. 
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— A Cookie is an HTTP header that contains a name/value pair. It is initially 
sent by the server to the browser and is sent back to the server with every 
further request. It can contain anything from a session-ID to user name and 
password for that server. 

— The HTTP authentication header is of course designed to authenticate, 
and in most cases identify a user. As a tracking method it is very reliable 
but hardly used since, unlike cookies these name and password are not saved 
to disk. Therefore it requires the user to enter a user name and a password 
again and again. 

— Repeated requests of the same static URL can also be linked. Even if no 
special tracking methods are used. Just the fact that this URL is rarely re- 
quested by anybody else, is sufficient. The same is true for repeated requests 
to dynamic pages, but dynamic pages generally tend to be more frequently 
requested. A news site for example will often have a few dynamic entry pages 
with the headlines, that are very often requested, and many static pages that 
contain the actual articles and are requested comparatively rarely. 

D Active Dummies 

A different approach to dummy traffic is, not giving out complete requests, but 
instructions on how to build dummy requests, and, even more important, what 
to do with the responses. 

These instruction could be quite simple, like a script that loads an HTML 
document and follows three randomly selected links from that document. They 
could include cookies that should be sent along with requests and instructions 
on how to handle received cookies and other received data. 

Even HTTP authentication data could be part of these instructions. Al- 
though, giving this kind of data to some unknown user is only advisable if no 
damage can be caused by that user. 

Giving access to web based email services like Hotmaif^ or GMX0 in an 
active dummy could result in loss of data because the sender could extract the 
authentication data from the active dummy’s instructions and use it to delete 
all incoming email or even to send out emails himself. Accessing an information 
service that requires user to register, may be a more suitable application. 

Links to follow from the received document, could be selected with a regular 
expression or some other pattern description. This would allow for embedded 
session IDs while narrowing down the set of possible links. On a news web site 
for example, this could be used to select links to political or scientific articles if 
these categories are part of the link’s URL or if they are always in a certain part 
of the HTML document. 

The examples show that active dummies are not that easy to produce and 
will have to be tailored to each web site to produce some reasonable results. A 
change of layout or structure on a web site could make them unusable instantly. 

http : / /www . hotmail . com/ 
http://www.gmx.net/ 
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Abstract. The dramatic growth of services and information on the In- 
ternet is accompanied by growing concerns over privacy. Trust negoti- 
ation is a new approach to establishing trust between strangers on the 
Internet through the bilateral exchange of digital credentials, the on-line 
analogue to the paper credentials people carry in their wallets today. 

When a credential contains sensitive information, its disclosure is gov- 
erned by an access control policy that specifies credentials that must be 
received before the sensitive credential is disclosed. This paper identihes 
the privacy vulnerabilities present in on-line trust negotiation and the 
approaches that can be taken to eliminate or minimize those vulnerabil- 
ities. The paper proposes modiheations to negotiation strategies to help 
prevent the inadvertent disclosure of credential information during on- 
line trust negotiation for those credentials or credential attributes that 
have been designated as sensitive, private information. 

1 Introduction 

The dramatic growth of services and information on the Internet is accompanied 
by growing concerns over privacy from individuals, corporations, and govern- 
ment. Privacy is of grave concern to individuals and organizations that operate in 
open systems like the Internet. According to Forrester Research, only six percent 
of North Americans have a high level of confidence in the way Internet sites man- 
age personal information j3|. These concerns cause some to operate only under 
complete anonymity or to avoid going on-line altogether. Although anonymity 
may be appropriate and desirable for casual web browsing, it is not a satisfac- 
tory solution to the privacy issue in conducting sensitive business transactions 
over the Internet because sensitive transactions usually require the disclosure of 
sensitive personal information. 

When a client and server initiate a transaction on the Internet, they often 
begin as strangers because they do not share the same security domain and the 
client lacks a local login. With automated trust establishment, strangers build 
trust by exchanging digital credentials, the on-line analogues of paper credentials 
that people carry in their wallets: digitally signed assertions by a credential issuer 
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about the credential owner. A credential is signed using the issuer’s private key 
and can be verified using the issuer’s public key. A credential describes one 
or more attributes of the owner, using attribute name/value pairs to describe 
properties of the owner asserted by the issuer. Each credential also contains the 
public key of the credential owner. The owner can use the corresponding private 
key to answer challenges or otherwise demonstrate ownership of the credential. 
Digital credentials can be implemented using, for example, X.509 certificates [Tj. 

In our approach to automated trust establishment, trust is established in- 
crementally by exchanging credentials and requests for credentials, an iterative 
process known as trust negotiation ini0. A trust negotiation strategy controls 
the exact content of the messages exchanged during trust negotiation, i.e., which 
credentials to disclose, when to disclose them, and when to terminate a nego- 
tiation. A naive strategy for establishing trust is for a client to disclose all its 
credentials to an unfamiliar server whenever a client makes a request. This sim- 
ple approach, which completely disregards privacy, is comparable to a customer 
entering a store for the first time, plopping down their wallet or purse on the 
counter, and inviting the merchant to rifle through the contents to determine 
whether or not to trust the customer. Another approach, one that considers 
credential sensitivity, is for each party to continually disclose every credential 
whose access control policy has been satisfied by credentials received from the 
other party. This eager approach results in needless credential disclosures, even 
though the other party is authorized to receive them. In a third approach, each 
party discloses access control policies that focus the negotiation on only those 
credentials actually needed to advance the negotiation. 

Access control policies {policies, for short) for local resources specify cre- 
dentials that the other negotiation participant must provide in order to obtain 
access to the resource. A party receives credentials from the other participant 
and checks to see if the relevant local policies are satisfied. A party may also 
disclose local policies, which constitute implicit requests for credentials from the 
other party that will advance the negotiation toward the goal of granting access 
to the protected resource. The other participant consults a disclosed policy to 
determine if the other participant can satisfy the policy and responds accord- 
ingly. The purpose of trust negotiation is to find a credential disclosure sequence 
{Cl, ...,Ck,R), where R is the service or other resource to which access was 
originally requested, such that when credential Ci is disclosed, its access control 
policy has been satisfied by credentials disclosed by the other party. 

As an example of a trust negotiation, suppose Alice is a landscape designer 
who wishes to order plants from Champaign Prairie Nursery (CPN). She fills out 
an order form on the web, checking an order for box to indicate that she wishes 
to be exempt from sales tax. Upon receipt of the order, CPN will want to see 
a valid credit card or Alice’s account credential issued by CPN, and a current 
reseller’s license. Alice has no account with CPN, but she does have a digital 
credit card. She is willing to show her reseller’s license to anyone, but she will 
show her credit card only to members of the Better Business Bureau. A possible 
credential exchange sequence for the above example is shown in Figure D 
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Landscape 

Designer 



CPN 




CPN’s Policies 



No_Sales_Tax_OK ■* — 

( Credit_Card \y 

CPN_Account) Reseller_License 
BBB_Member ■< — true 



Fig. 1. An example of access control policies and a safe disclosure sequence that es- 
tablishes trust between a server and a client. 



When a policy P contains sensitive information, then P itself requires pro- 
tection in the form of a policy for access to P. For example, a client interacting 
with an unfamiliar web server may request to see credentials that attest to the 
server’s handling of private information as well as certified security practices 
during the negotiation prior to disclosing further requests that the client consid- 
ers sensitive, such as a policy that specifies the combination of credentials that 
the client requires for access to a business process which the client considers a 
trade secret. This situation requires that trust be established gradually, so that a 
policy referencing a sensitive credential or containing a sensitive constraint is not 
disclosed to a total stranger. Seamons et al. mu introduced the notion of policy 
graphs, to represent the layers of access control policies used to guard a resource 
during gradual trust establishment. A policy graph for a protected resource R is 
a finite directed acyclic graph with a single source node and a single sink R. All 
the nodes except R represent policies that specify the properties a negotiation 
participant may be required to demonstrate in order to gain access to R. Each 
sensitive credential and service will have its own separate policy graph. Each 
policy represented as a node in a policy graph G implicitly also has its own 
graph, which is the maximum subgraph of G for which that policy node is the 
sole sink. Intuitively, the resource represented by node P can be disclosed only 
if there exists a path from the source to P such that the disclosed credentials 
satisfy every policy along the path except possibly the policy at node P. Policy 
graphs allow a policy writer to gradually enforce sensitive constraints on creden- 
tials, without showing all the constraints to the other party at the beginning of 
the negotiation. Figure |21 shows an example of policy graphs and a policy path 
from source node Pq to a policy node Pn- 

Finally, there will be resources so confidential that the resource owner will 
not disclose the policy governing the resource under any circumstances. In this 
situation, only people who know the policy in advance and who satisfy the policy 
proactively without being provided a clue can have access to the resource. 

Policy disclosures have the potential for serious breaches in privacy that 
violate the basic safety guarantees that trust negotiation should provide. In this 
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Fig. 2. An example policy graph. A solid line means the two nodes are adjacent to each 
other while a dashed line means there may be several nodes between the two policy 
nodes. 

paper, we identify the ways that trust negotiation strategies fail to adequately 
protect credential privacy. We then suggest practical remedies that can help 
to safeguard privacy during trust negotiation and argue that these remedies 
preserve trust negotiation safety guarantees. This paper also describes the role 
of trust negotiation in automatically determining the privacy handling practices 
of servers, so that clients can automatically enforce their privacy preferences 
and rely on third parties for greater assurance of a server’s privacy handling 
practices. 

2 Privacy Vulnerabilities during Trust Negotiation 

This section presents four ways that privacy can be compromised during on- 
line trust negotiation. The first two compromises can occur with existing trust 
negotiation strategies that rely on policy disclosures to guide the negotiation. 
The third is related to attacks on the policy specification. Finally, the fourth 
relates to issues regarding the privacy practices of web servers. 

2.1 Possession or Non-possession of a Sensitive Credential 

The trust relationships that individuals maintain are often considered sensitive. 
For instance, employers, health care providers, insurance companies, business 
suppliers, and financial institutions may maintain potentially sensitive trust re- 
lationships with individuals. These trusted institutions are often the issuers of 
digital credentials. The type of the credential can be a reflection of a trust re- 
lationship, such as an IBM employee credential or a GM preferred supplier cre- 
dential. The existence of credentials that reflect trusted relationships could lead 
to the situation in which the possession or non-possession of a certain type of 
credential is considered sensitive information. Simply revealing the type of the 
credential may release more information about a trusted relationship than the 
holder of the credential is comfortable disclosing. We call this kind of credential 
possession-sensitive. 

Not all sensitive credentials will be possession-sensitive. For example, a gener- 
ic employee credential may not be possession-sensitive since the specific employer 
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information is included as an attribute value in the credential, not as part of the 
credential type. In this case, the credential type conveys less information about 
the contents of the credential. A negotiation participant might have no qualms 
about revealing the fact that they possess an employee credential to a stranger. 
However, the attribute values in the credential that identify the employer might 
be considered sensitive information. 

Most requests for a credential will specify the desired type of the credential. 
As mentioned earlier, when a request for a sensitive credential C is received, a 
negotiation participant who possesses the credential typically responds with a 
counter request for the credentials necessary to establish enough trust for C to 
be disclosed. This approach has potential privacy protection loopholes because 
the issuing of a counter request can be interpreted as a strong indication of pos- 
session of the requested credential, and the absence of a counter request can be 
interpreted as a strong indication of non-possession of the credential. Thus, the 
behavior of a negotiation participant may present clues regarding the posses- 
sion or non-possession of a sensitive credential, even when the credential itself is 
never actually disclosed. In order to guard against the release of sensitive infor- 
mation when a possession-sensitive credential is requested during negotiation, 
its owner’s behavior must not allow the other party to infer whether or not the 
owner possess that credential. 

2.2 Sensitive Credential Attributes 

Besides credential type, other attributes in a credential can also contain sensitive 
information. For example, the issuer of the credential can convey a strong indica- 
tion of the nature of the credential and the trust relationship that it represents. 
Other examples of sensitive credential attributes include age. Social Security 
number, salary, credit rating, etc. We refer to a credential with a sensitive at- 
tribute as an attribute-sensitive credential. 

Trust negotiation may also violate privacy when Alice’s credential contains a 
sensitive attribute value and Bob’s credential request (disclosed policy) specifies 
a constraint on that sensitive attribute value. If Alice issues a counter request in 
response to Bob’s credential request. Bob can infer that Alice owns a credential 
that satisfies the constraint. 

Policy disclosures can be used to probe to “determine” the exact value of 
a sensitive attribute without the credential ever being disclosed. For example, 
suppose that Alice considers her age to be sensitive, and the policy x.type = 
drivers Jicense A x.age > 25 is disclosed to hei0. Normally Alice would respond 
by disclosing the policy for her driver’s license. If Bob is an adversary who is 

^ In practice, the constraint on the credential’s age attribute should be expressed as a 
comparison between the credential’s date_of_birth attribute and a specific date. Also, 
we need to specify that the issuer of the credential is from a qualified government 
bureau, and that the other party should authenticate to the owner of the credential. 
We omit such detailed constraints in order to focus on the issue of attribute-sensitive 
and possession-sensitive credentials. We will make the same simplification in the rest 
of this paper. 
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not qualified to gain access to Alice’s driver’s license, then Bob could engage 
Alice in a series of trust negotiations involving policy disclosures of the form 
x.type = drivers Jicense A x.age > 24, x.type = drivers -license A x.age > 25, 
x.type = drivers Jicense A x.age > 26, etc. Once Alice fails to respond with a 
policy. Bob can infer that the credential no longer satisfies the constraint. Thus, 
Bob can determine Alice’s age without seeing her driver’s license. 

A credential may contain multiple sensitive attributes. In some situations, 
a subset of the sensitive attributes in Alice’s credential may be of interest to 
Bob. One interesting approach to protecting privacy is to selectively disclose 
attributes within a credential so that only the needed subset is made available 
to the recipient of the credential ^ |E| . 

2.3 Extraneous Information Gathering 

During trust negotiation, one of the parties may request credentials that are not 
absolutely necessary or relevant to the trust requirements of the transaction, 
even though that party may be entitled to see them. A policy could require the 
disclosure of an inordinate amount of information during a trust negotiation, 
beyond what is truly needed to protect its resource. An unscrupulous party could 
use seemingly legitimate credential requests to gather extraneous information, 
thus violating the privacy of the other party. 

Policies should be disclosed over a secure channel. Otherwise, an attacker 
can modify a policy during transmission to increase the number of required cre- 
dentials, and force a participant to disclose more information than the requester 
intends. The attacker could eavesdrop on the subsequent exchange and gather 
the additional sensitive data, even if the bona fide negotiation participant pays 
no heed to the unexpected additional data. Even when a negotiation takes place 
over a secure channel, a similar threat exists for policy data stored on disk. If 
an attacker gains temporary access to the host that manages the policy data, 
an attacker could modify the policy on disk before it is disclosed over the secure 
channel, potentially resulting in improper, excessive credential disclosures or in 
loss of the intended protection for local resources. In this scenario, the attacker 
would not gain access to the sensitive information in any subsequent secure ne- 
gotiations, but the attacker effectively disrupts security by causing improper 
disclosures that violate the original policy specifications. 

In order to prevent malicious or accidental modification of policies, a policy 
can be digitally signed to protect its integrity. This signature will enable Bob to 
detect any modification of Alice’s policy that took place during transmission over 
an insecure channel or on disk prior to transmission, as long as Alice’s private 
key, used to sign the policy, has not been compromised. 

2.4 Privacy Practices 

Trust negotiation does not control or safeguard information once it has been 
disclosed. Thus sensitive information obtained during trust negotiation may be 
intentionally or unwittingly distributed to an unauthorized third party later on. 
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Internet users want control over what information they disclose to web sites, 
and they want assurances about what those sites will do with the informa- 
tion once it is disclosed to them. However, control over personal information 
on line is handled in an ad hoc fashion today. In the United States, business is 
active through self-regulation to protect privacy. Companies such as TRUSTe 
(http://www.truste.org) and BBBOnline (http://www.bbbonline.org) of- 
fer privacy seals to a web site that publishes its policy for privacy handling 
practices and adheres to it. The TRUSTe privacy seal is an indication that the 
web site has a publicly available privacy policy to which it adheres. Users can 
visually recognize the seal and can read the associated privacy policy. Perhaps 
many users never bother to read and understand the policy. The mere presence 
of the seal and privacy policy link may alleviate privacy concerns to the point 
that users are willing to interact with the server. However, the seal itself is sus- 
ceptible to forgery, and the seal’s presence says nothing concerning the details of 
the web site’s privacy policy. Interested users must wade through the fine print 
of legal-sounding documents on the web site to glean the details of the privacy 
handling policy, an arduous task that most users lack the stamina to complete. 
Further, many privacy policies include the provision that they are subject to 
change without notice, rendering their guarantees worthless. For more sensitive 
interactions, clients will need to verify automatically that Internet sites agree to 
follow certain procedures that meet the client’s needs, or that Internet sites are 
certified to adhere to standardized, well-known privacy practices and procedures. 



3 Privacy Safeguards for Trust Negotiation 

The previous section introduced privacy loopholes in trust negotiations that dis- 
close access control policies. These shortcomings can be remedied by introducing 
stronger privacy controls during trust negotiation to prevent the inadvertent dis- 
closure of sensitive credential information during a negotiation, as discussed in 
this section. Under one approach, policies governing credential disclosure are ex- 
tended to support the designation of whether possession or non-possession of the 
credential type or any of the credential attribute values is considered sensitive 
information. When requests for credentials are received, the security agent must 
determine the nature of the requests and determine if any credential privacy 
protections are violated before responding to the request. Another approach is 
to dynamically modify policies to prevent leakage of sensitive information. 

3.1 No Response 

Trust negotiation strategies traditionally ensure that all disclosures are safe IE!. 
The disclosure of a credential or policy is safe if the credentials disclosed by the 
other party satisfy the access control policy that governs access to the creden- 
tial or policy. When a possession-sensitive credential is requested during trust 
negotiation, responding to that request with a policy disclosure can amount to 
an admission of possession without actually disclosing the credential. Failure to 
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respond with a counter request can amount to an admission of non-possession. 
In order to protect privacy, the definition of a safe disclosure must be extended. 

When Alice receives Bob’s policy Q, which refers to a possession-sensitive 
credential C, then if Alice discloses the policy P for C, Bob may infer that she 
possesses C. If disclosing C at this point in the negotiation is not safe, then one 
possibility is to conclude that it is unsafe to disclose P. Then Bob will be unable 
to determine correctly whether Alice possesses C. 

To add this feature to current trust negotiation systems, we can extend the 
model for policy graphs presented in uni to support possession-sensitive creden- 
tials. Previously, the policy graph model assumed that the source policy node 
of a policy graph can be disclosed, and only internal policy nodes in the graph 
are sensitive. In the case of a possession-sensitive credential, an indicator can 
be associated with the source node of a policy graph to indicate whether or not 
the policy can be disclosed. This can be viewed as a generalization of the requi- 
site/prerequisite rules of PI to apply to policy graphs. For possession-sensitive 
credentials, no policies in the graph will be disclosed. Under this approach, Al- 
ice will only disclose C if Bob pushes the necessary credentials to her without 
being prompted. The downside of this approach is that negotiations can fail to 
establish trust when it is theoretically possible to do so. 

Extending policy graphs to support possession-sensitive credentials indirectly 
benefits Alice when she wants to disguise the fact that she does not possess cer- 
tain credentials. Based on Alice’s behavior. Bob is unable to distinguish whether 
Alice’s failure to respond with a counter request during trust negotiation is a 
sign that Alice does not possess the credential or a sign that she possesses the 
credential and is unwilling to acknowledge that fact because she considers the 
credential possession-sensitive. 

3.2 Pretend to Possess a Credential 

If Bob asks for credential C and Alice doesn’t have C, some previously pro- 
posed trust negotiation strategies may require Alice to tell Bob that she does 
not possess C. This is a problem if Alice is sensitive about the fact that she does 
not have C. If we assume that C is sensitive and those that possess C are not 
opposed to having others believe that they possess it, then suppose Alice can 
create a policy governing access to C, even if she does not possess C. Whenever 
Alice receives a request for C, she discloses the same policy for C as those who 
possess C. Once strangers demonstrate that they can be trusted with informa- 
tion regarding the credential C, Alice can send an admission that she does not 
possess C. 

To support the ability to mislead strangers about whether one possesses a 
sensitive credential, we can extend the policy graph model to include an internal 
policy node in a policy graph with the policy false. For example, suppose that 
policy P governs access to C. Those that do not possess C can use a policy graph 
for C that is obtained as follows. Let G be the policy graph for C when a party 
possesses C. We can replace the sink node of G by false, add a new sink node 
representing C , add an edge from false to the new sink node, and let the resulting 
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graph be G' . Then the graph for C when a party does not possess C should be 
G' . Whenever G is requested, the source node of G' is disclosed. Under this 
approach, Bob will have to pass the same set of trustworthiness tests, whether 
or not Alice possesses G , before he can determine whether Alice possesses G. 

This approach safeguards non-possession of a sensitive credential from those 
who are not authorized to know that information. Once this approach is widely 
adopted, it effectively safeguards possession of a sensitive credential, since an 
attacker cannot be certain whether or not the other party in the negotiation 
possesses the credential, or is simply acting as though they possess it. This 
approach may be practical only for safeguarding possession-sensitive credentials 
when there are standard, reusable, widely adopted policies for certain types of 
credentials that come as the standard equipment in trust negotiation packages. 

3.3 Dynamic Policy Graphs 

As described in the section on sensitive credential attributes, attribute-sensitive 
credentials are vulnerable to a probing attack that can expose the exact value of a 
sensitive attribute. One way to overcome this vulnerability with trust negotiation 
is by dynamically modifying policies before they are evaluated. 

Because of the hierarchical nature of policy graphs and sharing of variables 
between policies, there may be several ways to specify the same set of constraints 
on credentials in a resource’s policy graph. Some of these specifications may cause 
the other party to leak information about the credentials it possesses, even if 
the credentials are never disclosed. For example, suppose a customer places an 
order for a rental car at an on-line car rental service. The policy for such a 
service R states that the customer must show his/her driver’s license to prove 
that his/her age is 25 or over. Figure 0 shows two ways to express this policy. 
The first version has a single policy node in i?’s policy graph, which states 
x.type = drivers -license A x.age > 25. When the customer’s security agent 
receives such a policy, it checks its driver’s license credential. If the age attribute 
is over 25, it sends back the access control policy for the customer’s driver’s 
license as the next message. Otherwise, the negotiation fails. The problem is 
that if the on-line store receives the policy of the customer’s driver’s license 
credential, it can guess that the customer is over 25, even though the driver’s 
license has not been disclosed. When the customer is sensitive about the age 
attribute, such information leakage is undesirable. 

In the second policy graph in Figure E] the source of i?’s policy graph con- 
tains only the constraint on the credential’s type attribute, namely, x.type = 
drivers -license. The source has only one child P 2 , which is x.age > 25, and 
P 2 ’s child is the protected resource R. By adopting such a policy graph, when 
the client requests access to R, the source node of R’s policy graph is first sent 
back, which contains no constraints on sensitive attributes. Therefore, when the 
client returns its driver’s license’s policy in the next message, the server can- 
not infer any information about that credential’s age attribute. The negotiation 
continues, and only after the client actually discloses its driver’s license can the 
server know the credential’s age attribute and check whether the policy in the 
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P = x.type = drivers license A x.age > 25 
Pi = x.type = drivers license 
P 2 = x.age > 25 





I Car Rental Reservatio 





Fig. 3. Two different policy graphs that express the requirements for an on-line car 
rental reservation service. 

next level is satisfied. Thus no information about the driver’s license’s sensitive 
attributes is leaked. 

One straightforward approach to preventing leakage is for Alice to act con- 
servatively, as defined in Section 10 on no response; i.e., if Alice receives a policy 
P that involves constraints on sensitive attributes of a credential, and it is the 
first time that that credential variable appears in the path from the source to 
P, Alice simply assumes that the policy will never be satisfied. Thus no policy 
for that credential will be sent back. The obvious drawback is that a possible 
successful negotiation may fail because a party does not cooperate enough. As 
another approach, we may require that when a policy P is disclosed which con- 
tains a credential variable that does not occur in any predecessor of P along a 
path from the source to P, then P cannot contain constraints on sensitive at- 
tributes. If all the policy graphs satisfy this condition, as we show in the second 
policy graph in Figure 01 attribute-sensitive credentials are protected. However, 
in practice, imposing such a requirement severely limits a policy writer’s flexibil- 
ity. For example, negotiation failure may seem mysterious because policy nodes 
that contain no new credential variables may never be disclosed. What’s more, 
in most cases, a policy writer will not know which attributes of the other party’s 
credentials are sensitive, making policy graph design extremely hard. 

In fact, what makes the above two approaches impractical is the gap be- 
tween the expectations of Bob, who designed the local policy graphs, and Alice, 
who tries to satisfy those policies. On one hand. Bob wants a policy to contain 
complete constraints on credentials so that the whole policy graph is simple and 
manageable. On the other hand, Alice wants to separate constraints on sensitive 
credentials from constraints on insensitive ones; this results in better protection 
in information of credentials. To fill this gap, we can dynamically modify the 
policy before Alice’s strategy engine starts to try to satisfy it. 

Dynamic policy modifications are performed by a policy transformation agent 
(PTA) inside each party’s security agent (see Figure 01). When Alice sends a re- 
source’s access control policy P to Bob, before the policy goes to Bob’s strategy 
engine, Bob’s PTA first checks whether P involves any sensitive attributes of 
Bob’s credentials. If so, the PTA transforms P into a hierarchical policy. That 
is, the PTA breaks P into a set of subpolicies whose conjunction entails P, and 
arranges the subpolicies into a sequence such that the first subpolicy in the se- 
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Fig. 4. Upon receiving a policy P, the security agent first lets the PTA check whether 
P involves sensitive attributes of credentials. If so, the PTA will transform P to P, and 
pass P' to the strategy engine. Based on P' instead of P, the strategy engine suggests 
the response, which is sent back to the other negotiation party. Note that this figure 
shows only how the PTA works. We omit other necessary modules of a security agent, 
such as the credential verification module and policy compliance-checking module. 



quence does not involve any sensitive attributes. After such a sequence is formed, 
only the first policy in the sequence is passed to Bob’s strategy engine, which 
checks Bob’s credentials and sends back policies for the relevant credentials. Be- 
cause of the policy transformation done by the PTA, the strategy engine does 
not see the rest of the original policy that involves constraints on sensitive at- 
tributes. Therefore, its response does not imply that Bob’s credentials can satisfy 
those sensitive constraints. 

Let’s look at our previous example. Suppose Alice sends i?’s policy x.type = 
drivers -license A x.age > 25 to Bob. When Bob receives it, his PTA checks the 
driver’s license credential and finds that the age attribute is sensitive. Then it de- 
composes the original policy into two subpolicies: Pi = x.type = drivers -license 
and P 2 = x.age > 25. Pi and P 2 form a sequence, and Pi is passed to the local 
strategy engine. In this case, even if the age given on Bob’s driver’s license is 
less than 25, the strategy engine will still send its policy in the next message. If 
Alice knows that Bob has a PTA, then Alice cannot infer anything about Bob’s 
age from Bob’s message. 

3.4 Privacy Practices 

Trust negotiation presents the opportunity to develop a more systematic ap- 
proach to handling privacy practices on the web. Using trust negotiation, certi- 
fied privacy practices can be represented in the form of digital credentials that 
can be disclosed in response to user policies that require certain privacy practice 
guarantees. In contrast to the web site privacy logos used today, the security 
agent on the user side can make sure that privacy practice credentials are not 
forged, and verify the ownership of such credentials using the public key encryp- 
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tion algorithm inherent in digital credentials. This can be done automatically 
without involving the user. 

Representing privacy practices in credentials rather than in plaintext web 
documents satisfies several purposes. First, standardized privacy practice cre- 
dential types will more easily support automated verification of privacy policies 
in software. Second, trust in the issuer that signs the credentials is a way to cre- 
ate stronger trust in the privacy practices of the organization. A third party that 
audits the sites’ practices can issue credentials that describe privacy practices. 

When dealing with privacy issues, web sites typically offer some form of an 
opt-in or opt-out process. The client’s security agent can record what has been 
opted in and opted out and request and receive a signed commitment from 
the server specifying the agreement regarding the collection and handling of the 
client’s private information. We call the signed commitment from the web site an 
electronic receipt that functions as a proof in case of dispute tm- The electronic 
receipt can also be signed by the client to protect the website from the client’s 
repudiation. 

Such a commitment can be referenced later should the client seek redress of 
some kind after it is detected that the server has broken its earlier commitment. 
This approach does not technically prevent the disclosure, but only provides the 
client with evidence to present during the recourse process. Technical solutions 
will require that associated legal and social processes beyond the scope of this 
paper be effective in mitigating privacy risks that currently prevent certain on- 
line transactions. 

4 Related Work 

Recent work on trust negotiation strategies (e.g. m) includes proposals for 
policy disclosures. No previous work considers all of the trust negotiation privacy 
vulnerabilities that we discuss. By adopting our proposals, existing strategies will 
be able to provide greater privacy protection guarantees. 

Bonatti et al. |5| introduce a uniform framework and model to regulate service 
access and information release over the Internet. Their framework is composed of 
a language with formal semantics and a policy filtering mechanism that provides 
service renaming in order to protect privacy. Their work adopts a model very 
similar to the trust negotiation model presented in this paper. 

X-Sec P is an XML-based language for specifying credentials and access 
control policies. X-Sec was designed to protect Web documents. Because creden- 
tials and policies can themselves be expressed as XML documents, they can be 
protected using the same mechanisms developed for the protection of Web docu- 
ments. If X-Sec were used in trust negotiation, the privacy protection approaches 
presented in this paper would help safeguard the contents of X-Sec credentials 
and policies. 

Winsborough et al. includes a design for using the RTq language in trust 
negotiation. They describe some of the privacy vulnerabilities in existing trust 
negotiation protocols and introduce support for attribute acknowledgment poll- 
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cies. In their approach, users establish policies for attributes that they consider 
sensitive, whether or not they actually possess them. The RTq language supports 
credentials containing a single attribute. Thus, their use of attribute acknowl- 
edgement policies is akin to our approach of pretending to possess a credential. 
Our approach to protecting sensitive attributes using policy transformation is 
one potential solution that could be adopted once RT is extended to support 
credentials containing multiple attributes. Another distinction between our work 
and the support provided in RTq is that we argue that the capability of pretend- 
ing to possess a credential is not sufficient to adequately safeguard possession 
of a sensitive credential if only those that actually possess the credential are 
sensitive about that fact. Thus, we differ from the RT work in that we also 
recommend support for ignoring a request for a possession sensitive credential 
whenever there is insufficient trust to immediately disclose the credential. 

Approaches to safeguarding the privacy of sensitive attributes within cre- 
dentials have been presented by Persiano et al. 0 and Brands Persiano 
introduced the concept of a crypto certificate and Brands introduced Private 
Credentials. Both approaches aim to support the selective disclosure of sensitive 
attributes within a credential. Incorporating the selective disclosure of creden- 
tial attributes within trust negotiation will extend the granularity of credential 
disclosure access control to the attribute level, giving users more flexibility in 
expressing their privacy requirements. 

Biskup describes two methods of preserving secrecy in a database system |2| , 
namely refusal and lying. When an unauthorized user requests access to sensitive 
information, the refusal approach is to return a special value like mum, while 
the lying approach is to return the negation of the real answer. Biskup’s work 
assumes that the user knows that the requested information exists on the system 
yet does not know if that information is considered secret and is protected by 
the system. Secrecies known is the situation in which a user knows information 
on the system that is considered a secret; otherwise, it is secrecies unknown. 
Refusal works for both the secrecies known and secrecies unknown cases, while 
lying works only under secrecies unknown. The assumption of knowledge of the 
existence of secret data on the system is similar to the situation with attribute- 
sensitive credentials. The dynamic policies approach is a type of refusal, at least 
for the constraint portion on the sensitive attribute. The failure to respond with 
a counter request for a possession-sensitive credential is also a form of refusal. 
Pretending to own a possession-sensitive credential is akin to the lying approach, 
at least temporarily. 

The PSP standard 0 defined by W3C focuses on negotiating the disclosure of 
a user’s sensitive private information based on the privacy practices of the server. 
Trust negotiation generalizes PSP to base disclosure on any server property of 
interest to the client that can be represented in a credential. The work on trust 
negotiation focuses on certified properties of the credential holder, while PSP is 
based on data submitted by the client that are claims the client makes about 
itself. Support for both kinds of information in trust negotiation is warranted. 
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Bonatti et al. also provide for both types of data in client and server portfolios 
in their system P|. 

5 Conclusion and Future Work 

This paper identifies the privacy vulnerabilities present in on-line trust negotia- 
tion, especially when access control policies are being disclosed. Although policy 
disclosure helps to focus the negotiation on those credentials that can lead to a 
successful negotiation, the approach is subject to serious lapses in privacy pro- 
tection and can inadvertently disclose evidence of possession or non-possession of 
a sensitive credential or inadvertently disclose the value of a sensitive credential 
attribute, without the credential ever being disclosed. Another privacy risk dur- 
ing trust negotiation is excessive gathering of information that is not germane 
to the transaction, due to unscrupulous negotiation participants or attacks on 
policy integrity. And finally, there is cause for concern regarding the handling of 
private information once it has been disclosed. 

This paper identifies two kinds of sensitive credentials, a possession-sensitive 
credential and an attribute-sensitive credential, and identifies several potential 
means of preventing information leakage for these kinds of credentials. The paper 
proposes modifications to negotiation strategies to help prevent the inadvertent 
disclosure of credential information during on-line trust negotiation for those 
credentials or credential attributes that have been designated as sensitive, private 
information. The paper also describes how trust negotiation can be used to 
enforce the privacy preferences of clients by verifying a server’s privacy handling 
policies and certifications automatically, potentially strengthening the current 
ad hoc process that services like TRUSTe and BBB Online employ for managing 
privacy practices on the Internet. 

After a formal analysis of the privacy guarantees provided by these ap- 
proaches, we will implement one or more of the approaches in TrustBuilder, our 
prototype environment for developing reusable trust negotiation components 
that run in a variety of contexts |^. We have been experimenting with trust 
negotiation strategies that exchange policies in TrustBuilder. Since privacy vul- 
nerabilities exist when policy disclosures take place, this will provide us with an 
excellent environment in which to experiment with our proposed solutions to the 
privacy problem. 
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Abstract. This paper reports our experimental work in using commer- 
cial secure coprocessors to control access to private data. In our initial 
project, we look at archived network traffic. We seek to protect the pri- 
vacy rights of a large population of data producers by restricting compu- 
tation on a central authority’s machine. The coprocessor approach pro- 
vides more flexibility and assurance in specifying and enforcing access 
policy than purely cryptographic schemes. This work extends to other 
application domains, such as distributing and sharing academic research 
data. 



1 Introduction 

This paper presents a snapshot of an ongoing experimental project to use high- 
end secure coprocessors to arbitrate access to private data. 

Our work was initially inspired by Michigan’s Packet Vault project |2|, which 
examined the engineering question of how a central authority, such as law en- 
forcement or university administration, might archive traffic on a local area 
network, for later forensic use. 

In addition to the engineering question, we perceived a Digital Rights Man- 
agement (DRM) aspect of an unusual kind. In the standard DRM scenario, a 
large data producer seeks to control use of its data by multiple “small” users. We 
see the possibility to turn this around and enable individual users to control with 
confidence how their data is to be used by a powerful authority (“Big Brother”). 
Why would people agree to have data collected at all? For network traffic in 
particular, some benefits of collection are certainly plausible, such as: 

— the ability of administrators to diagnose whether a particular network attack 
occurred; 

— the ability of law enforcement to gather evidence for illegal activity that the 
community regards as sufficiently egregious; 

— the ability of scientists to analyze suitably sanitized Web surfing activity of 
consenting individuals. 

How do we restrict the usage rights on archived data to exactly socially pre- 
scribed policy — especially given human nature’s inclination to exceed authority? 

R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. 144-|1^^ 2003. 

© Springer- Verlag Berlin Heidelberg 2003 
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We designed and prototyped a system using commercially available high- 
assurance secure coprocessors. We felt that a working prototype would further 
several goals: 

— validating that this approach to user privacy works in practice; 

— demonstrating the advantages of a computational approach to rights enforce- 
ment, compared to strictly a cryptographic one; 

— demonstrating the application potential of current commercial off-the-shelf 
(COTS) secure coprocessor technology; 

— demonstrating a socially equitable use of tamper-respondent technology: 
rather than impinging citizens’ computation, it is restricting Big Brother 
on their behalf. 

We believe this work will have applicability in other domains of protecting 
community rights — such as key escrow and recovery in a university PKI. We 
discuss this further in Sect. It).;il 

This Paper. Section|2| describes in some detail the motivations behind the Packet 
Vault and our Armored Vault. Section 21 describes our design, and Sect. 0 our 
implementation of the prototype. Sectional has our discussion of the prototype 
implementation. Section |S] discusses the wider applicability of our design, and 
what we plan to do next. 

Code. Our implementation is available for downloac0, as is the developer’s 
toolkit for the IBM 4758 platfornJl. 

2 Background 

In this section we examine some developments motivating our prototype’s sub- 
ject matter of comprehensive network traffic archival. Then we describe the 
Armored Vault — the original archival tool on which we base our prototype, and 
the reasons for seeking to armor such a vault. Finally we list some other works 
relevant to our topic. 

2.1 Evidence Collection 

Government authorities on both sides of the Atlantic are keen to get their hands 
on network traffic. In the US, this is done by the FBI with the Carnivore tool 
(renamed to DCSIOOO). DCSIOOO is a software system designed to run at an ISP 
and collect communications of the parties under surveillance. [TTl?nE) As part 
of provisions for combating terrorism, European Union governments are seeking 
to revise the EU Directive on data protection and privacy of 19970 to allow for 
retention of telecommunications data in cases of national security significance, 
apparently without requirements for case-by-case court authorization and for 
minimal targeting of specific suspects. I2nj 

^ http : / /www. cs . dartmouth. edu/~pkilab/ code/vault . tar .gz 
http : / /www . alphaworks . ibm . com/tech/4758toolkit 
® Directive 95/46/EC 
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2.2 Complete Archival of Net Traffic: The Packet Vault 

Storing all network traffic is on the agenda, and warrants some attention. In 
addition to increased surveillance power and convenience for law enforcement, 
traffic could be stored for network intrusion evidence, or authorized research. 

The Packet Vault is a pioneering device to address the plentiful security 
questions posed by storing a complete record of all network traffic |2j . The follow- 
on Advanced Packet Vault can keep up with a 100 Mbit ethernet p. A design 
goal is to store packets securely, so that they may be accessed only through 
the security mechanism imposed by the vault: in an archive (on a CD-ROM) 
host-to-host conversations are encrypted with separate secret keys, and the set 
of these keys is encrypted using a public key algorithm with the private key held 
by a trusted entity, the vault owner (a person). Access to the archived data is 
accomplished by getting the owner to decrypt the secret keys for the desired 
conversations, which they presumably do once they decide all access conditions 
are satisfied. 



Weaknesses. The Packet Vault approach to securing archives leaves some im- 
portant security questions: 

Vulnerability to insider attack. The Vault depends on the trustworthiness of the 
vault owners. If they are law-enforcement personnel for example, trusting a com- 
plete record of network traffic to them may be objectionable to many users. It 
is not very different from letting them have a dear-text record of the net traffic, 
and trusting them to use it as all concerned parties (whose data is stored) would 
like. Even if the owners act in good faith, their private key may be compromised, 
which could either expose all the archives on which the public key of the pair 
was used, or necessitate those archives’ destruction. 

Access flexibility. Accessing the archived packets is not very flexible. Full access 
or no access to some set of conversations can be granted to a requester; nothing 
else is possible. Some other useful access possibilities include: 

— Accessing data at finer granularity than a conversation. Rounding to the 
nearest conversation could either skip needed data, or produce excess data 
to the extent of making the released data unsuitable for evidentiary useEl 
Access to computation seems to provide more power in selecting packets 
than a purely cryptographic scheme would. 

— Postprocessing before output. Such a capability could be useful in a lot of 
applications — deciding the presence or absence of something in the archive 
without revealing additional details, anonymizing data for experiments (eg. 
blanked source headers), or producing statistics of the net traffic. 

It is not clear how a some simple cryptographic extension to the Packet 
Vault could provide this particular feature. By definition, computation must 
be done on packets which must not themselves be seen in their original 

Wiretap authorizations often include a minimization requirement — data collected is 
to be strictly limited to that needed for the investigation. 
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form. Proving that filtered data correctly follows from genuine archive data, 
without disclosing the archive data, appears difficult to solve in general using 
cryptography alone. 

2.3 Related Previous Work 

These are some works related to policy expression and compliance checking, 
the use of secure coprocessors for controlling access to data, and distribution of 
sensitive data with attached access policy. 

An early comprehensive and general-purpose system for policy specification 
and checking was PolicyMaker PD. It has since evolved into Key Note 0. The 
SPKI system dH deals with authorization in a distributed environment. The 
Trust Policy Language (TPL) ^3] is another suggested approach to policy ex- 
pression and checking. A system such as these would be appropriate to plug 
into a more complete version of our prototype to provide a usable mechanism to 
specify and check access policy. 

Policy- carrying, Policy- enforcing digital objects apply enforcement of com- 
putational policy to data. They use language-based control (inlined reference 
monitors for Java) to enforce compliance. im 

Gennaro et al. suggest the use of secure hardware to provide high assur- 
ance against misuse and the possibility of an audit trail for their key recovery 
system. m 

Yee et al. describe a use of secure coprocessors in controlling access to 
data m- It covers protection of executable code, by encrypting the code such 
that only the designated coprocessor can access it. There is no discussion of 
access controls finer than “only coprocessor X can run this code” . Also the co- 
processors usecj were hand-built and not commercially available. 

Our work provides an interesting counterpoint to conclusions that controlling 
access to data by a large population is inherently doomed In our scenario, a 
large population controls access to their data by a small group, and because the 
access points are few, they can be controlled by high-end devices which provide 
high assurance. 

3 Prototype Design 

3.1 Overview 

All the interesting features mentioned in Sect . 12 . 2l 1 flexible packet selection, post- 
processing) could be provided by trusting the vault owners to perform them, for 
example perform calculations on the stored packets and give us results. Trusting 
refers not just to the integrity of the owners, but also to the integrity of the 
machines on which they perform their computations. 

We make use of secure hardware to be the “trusted party” in the archival 
system, replacing the vault owners previously described. Call this trusted party 
Solomon for the duration of this section. Solomon will serve us as follows: 

® Dyad, utilizing IBM Citadel hardware and a Mach 3.0 microkernel 
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Fig. 1. Overview of armored vault design. Details of the cryptographic organization 
are described in Sect. 18.41 



— He possesses an encryption key-pair and a signing key-pair, a description of 
how to authorize requests for data, and a description of how data is to be 
selected and processed before being given to a requester. 

— The stored traffic is all encrypted with Solomon’s encryption key. If we do 
not wish to trust someone to do this part, we could get Solomon to do it 
too. 

— Requests for access to the stored data are given to Solomon, who can then 
(1) evaluate if the request is worthy of being honored, (2) compute what 
data is to be released and (3) perform whatever computation is needed on 
that data to produce the final result, which he then signs and releases. 

Concretely, we use two secure coprocessors — one for collecting the net traffic 
and producing archives, and the other for arbitrating access to the archives. We 
refer to them as the Encoder and Decoder respectively. We use two because their 
tasks could be separated by space and time, and also the job of the Encoder is 
high-load and continuous. 

The Encoder encrypts the stored traffic using the Decoder’s encryption key, 
and signs the archive using its own signing key. The Decoder must decide if 
access requests against archives are authorized, decrypt the archive internally 
(no one outside can observe it decrypted) in order to perform selection of the 
desired data, compute the query result using the selected data and sign this final 
result. Fig. ^shows an overview of the archive generation and access procedures. 
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3.2 The Secure Hardware 

We use the IBM 4758 model 2 programmable secure coprocessor Some 

properties of the 4758 are: 

— It can be programmed in C, with full access to an embedded Operating 
System (CP/Q++ from IBM) and hardware-accelerated cryptographic func- 
tionality. Programs run on a 99-MHz Intel 486 processor, with 4 MB of main 
memory on the 002 model. 

— With high assurance, it can carry out computation without possibility of 
being observed or surreptitiously modifiecQ. 

— It can prove that some given data was produced by an uncompromised pro- 
gram running inside a coprocessor. This is done by signing the data in ques- 
tion with a private key which only the uncompromised program can know. 
Details of this process follow. 

The coprocessors are realized as PCI cards attached to a host workstation. 

Outbound Authentication. The 4758 provides a mechanism for proving that out- 
put alleged to have come from a coprocessor application actually came from an 
uncompromised instance of that application fTWT) . This is achieved by sign- 
ing the output with an Application key-pair, generated for the application by 
the coprocessor. This keypair is bound not only to the application, but also to 
the underlying software configuration (OS and boot software jH Along with the 
keypair the application provides a certificate containing the public key of the 
signing pair, and a certificate chain attesting to this certificate. In this chain is 
the identity of the application too. The chain has at its root a publicly available 
IBM Root certificate. 

3.3 Access Policy 

We gather all the information relating to how archives are to be accessed into 
an access policy. The policy is thus the central piece of the armored vault. The 
Decoder will give access to the archive only in accordance with the access policy, 
and no one can extract any more information, since the design of the coprocessor 
precludes circumventing its programmed behavior. 

To represent access policies we use a table, whose rows represent different 
entry points into the data, and whose columns represent the parameters of each 
entry point. Anyone wanting access must select which entry point to use, and 
then satisfy the requirements associated with it. The parameters associated with 
entry points are: 

® Note that the attack by Bond and Anderson 0 penetrated a version of the CCA 
software running on the 4758, but not the security provisions of the device itself. 
Those remains secure. 

^ The 4758 platform was the first device to receive FIPS 140-1 Level 4 validation 
|16l23j . 

® Bound in the sense that the keypair is destroyed if any of the associated software is 
changed signihcantly enough. 
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Request template. A template of a query selecting the desired data, with 
parameters to be filled by a particular access request. We call the format 
of the query the data selection language. An example could be “All email 
to/from email address X”. 

Macro limits. Limits on various properties of the result of the query. These 
could be things like total number of packets or bytes in the result, how many 
hosts are involved in the matching data, or packet rate with respect to time 
at the time of archival. 

Authorization. The authorization requirements for this entry point. These 
could be something like “Request must be signed by two district judges” . 
Post-processing. A filter which defines how the Decoder will compute a final 
result for the query based on the initial data selected. Some examples could 
be “Scrub all IP addresses”, or a statistical analysis of the data. 

The procedure for requesting data from the archive will be to indicate an 
entry point, and provide parameters for its request template. The request will 
contain the authorization data needed to satisfy the requirements of the chosen 
entry point, for example be signed by all the parties who need to authorize the 
access. 

The actual policy has to balance the opposing motivations of (1) allowing 
all acceptable queries, as decided at the time of archival, and (2) ensuring that 
there is no way for anyone, even rogue insiders, to gain access to more of the 
data than was intended. 

3.4 Cryptographic Organization 

The top-level cryptographic organization of the armored vault is as shown in 
Fig. n Here we describe the details of each step. 

The Encoder must be initialized by giving it the public encryption key of the 
Decoder which will control access to the archives produced by this Encoder. The 
key is contained in an Application Certificate, part of a chain as mentioned in 
Sect. 13.21 Note that the encoder can determine if the alleged decoder is genuine 
from the application identity in the certificate chain. 

The Encoder produces an archive which is structured as shown in Fig.O En- 
cryption of the stored packets is two-level — first the packets are encrypted with 
a TDES session key, and then the session key is encrypted with the encryption 
key of the Decoder. The Encoder provides verification for the archive by signing 
it and attaching its application certificate attesting to the signature. 

When the Decoder receives a request against an archive, it verifies the archive 
by checking the signature and the identity and supporting chain of the Encoder 
which produced the archive. It then decrypts the session key using its private 
decryption key, decrypts the packet dump using the session key, and carries on 
with processing the request. When the final result is computed, it signs it in the 
same way that the Encoder signs an archive. 
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Fig. 2. Structure of the archive produced by the encoder. 



4 Implementation 

4.1 Overview 

Our setup consists of a Linux PC acting as the host to both the Encoder and 
Decoder cards. We picked Linux because we prefer open systems, and the Linux 
driver for the 4758 is open-source. 

The user interface to the prototype vault consists of a command-line program 
running on the host, which performs two tasks with the Decoder: 

— Request its public encryption key/certificate together with a certificate chain 
(see Sect. attesting to the key, and save these in a file. For now the 
certificates are in the format used internally by the 4758 mi. 

— Run a request for data against an archive previously made by the Encoder. 

and two tasks with the Encoder: 

— Set the encryption key and supporting certificate chain of this Encoder’s 
partner Decoder, using the file generated by the Decoder above. 

— Produce an archive given a libpcap-format packet dump. This can only be 
done after a partner Decoder is established. 

Packet dumps can be produced by any packet sniffer program which uses 
libpcap as its packet-capture mechanism and allows binary packet dumps to be 
produced. Two possibilities are the quintessential tcpdump, and snort. 

4.2 Encoder Operation 

The Encoder takes a libpcap-format packet dump and produces an archive 
structured as shown in Fig. 0 It stores the access policy internally, and attaches 
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it to every archive. It performs everything described in our design, but being an 
early prototype is limited to processing only as much data as will fit into the 
coprocessor at once (about 1.7 MB with our current prototype). 

4.3 The Data Selection Language 

We use an existing package, Snort m, version 1.7, to provide the packet se- 
lection capability in our demo. Snort is a libpcap-based Network Intrusion- 
Detection System (NIDS) which can select packets using the Berkeley Packet 
Filter (BPF) language as well as its own rule system which features selection 
by packet content as well as by header fields. This rule language is our data 
selection language. The snort rule system is described in detail at 
http : //www. snort . org/writing_snort_rules .htm 

We chose Snort as it is an Open Source tool in active development, and active 
use. Important features are IP defragmentation, the capability to select packets 
by content, and a developing TCP stream re-assembly capability. 

Porting Snort. We compiled a subset of Snort (essentially the packet detection 
system) to run inside the Decoder card and interpret requests for data. This 
required implementations for non-STDC functions which were still needed for 
packet detection and processing, like inet_ntoa. The major challenge was im- 
plementing the stdio functions to enable the transfer of data to and from Snort 
when it ran inside the Decoder. CP/Q-I--I-, the current 4758 production OS, does 
not provide filesystem emulation, so we wrote implementations for a few of the 
POSIX filesystem calls (open, write, etc.) to operate with memory buffers inside 
the card. 

Snort Rules. Snort rules specify what packets Snort selects for further processing 
(like logging or alerts). They select packets based on packet headers or packet 
data. A simple example to select TCP packets from the http port for logging is 
log top any 80 -> Euiy any 

This only performs matching on packet headers. It could be read as “log TCP 
packets coming from any host, port 80, going to any host, any port”. A fancier 
example using content matching to produce an alert on noticing a potential 
attack (whose signature is the hex bytes 90C8 COFF FFFF) is 
alert top any any -> 192.168.1.0/24 143 

(content: " I 90C8 COFF FFFF I /bin/sh" ; msg: "IMAP buffer overflow!";) 



4.4 Decoder Operation 

The Decoder implements our design (limited to small archives which can fit in 
the coprocessor at once), with the following exceptions: 

— No post-processing of packets is currently possible. This will be a bit of work 
to implement fully, with reasonable capabilities, so we did not attack it in 
this prototype. 

— It has no authorization capabilities, except simple macro limits. 

The detailed Decoder operation is shown in Fig. 0 
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Archive Request 




Output 



Fig. 3. Decoder operation. The Snort rule used to run Snort is made by instantiating 
the template in the policy using values from the request. 



4.5 Access Policy 

We implement the access policy as XML-format text, with a row element for 
the entry points (rows in the policy table), inside which are XML elements for 
all the entry point fields, with the exception of post-processing which we have 
not implemented yet. We currently use very quick and simple “parsing” of the 
table, with an eye to use the expat XML parser. The table used in the current 
version of the prototype is shown in Fig. 0 



4.6 Request Structure 

Requests consist of a set of name= value assignments, one for the row number 
from the policy table through which the request is going, and the rest assignments 
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<?xml version=" 1 . 0"?> 

<policytable> 

<title> 

Experimental table 
</title> 

<row> 

<reqtemplate> 

log tcp any Sport -> any any (content : "sasho") 

</reqtemplate> 

<macrolimits> 

$total_packets < 100 
</macrolimits> 

</row> 

</policytable> 

Fig. 4. Current prototype policy. This policy allows the selection of TCP packets con- 
taining the word “sasho”, and coming from some port specified in the request (eg. 
“port=80”). If more than 100 packets match, the request will be declined. 

for the request template of the chosen row. An example which could be used with 
our sample policy in Fig.Elis 
row=l; port=80 

5 Discussion 

5.1 Our Implementation 

Snort. This was one of the successful aspects of this prototype. Snort is a very 
capable packet detection engine, and it runs happily inside the secure coproces- 
sor. It will enable us to extend our policy capabilities considerably, especially 
when we start to consider application-level data selection, and reassembled TCP 
streams become important. 

Policy. Our current prototype policy (in Fig. 0 is fairly basic, but it does 
demonstrate many key points about the armored vault: 

— Selection of packets can be computation-based. It appears difficult to select 
packets by content using differentiating cryptographj|j alone. 

— Authorization decisions can be computation-based, and secure since they are 
running inside secure hardware. In this case, even in the absence of a PKI to 
perform full authorization, an authorization decision can be made based on 
the number of packets in the result. Since computation must be performed 
to calculate arbitrary properties of the matching data set, this cannot be 
done securely without using secure hardware. 

® meaning that different sections of the stored archive are encrypted with different 
keys 
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Performance. High performance was not one of the aims of our prototype, but 
we include some figures for completeness. The Encoder could process a 1.6 MB 
packet dump to produce an archive in 6 seconds. A 630K dump took 2.3 seconds. 
A 2.0 MB dump failed due to insufficient memory. A Decoder run (without 
restrictions on packet number returned) on the 630K archive (1000 packets) 
which selected 105 packets took 6.3 sec. 

Future optimizations will clearly have to focus on the Encoder, which in 
practice will have to keep up with a fast network. 

5.2 Relation to the Advanced Packet Vault 

One of the primary concerns of the Advanced Packet Vault project from Michi- 
gan m is speed — to enable the vault to keep up with a 100Mb network at high 
load. The major areas of concern are system questions of keeping packets flow- 
ing into their final long-term storage as fast as they are picked up. Since we do 
not, nor do we intend to, consider questions of system infrastructure, our work 
complements the Advanced Vault well. The 4758 Model 2/23 secure coprocessor 
can perform TDES on bulk data at about 20 MB per second, which is sufficient 
to keep up with the Advanced Vault system infrastructure on a 100 Mb network. 

6 Conclusions and Future Work 

6.1 Feasibility 

We had several goals for our initial experiments. The community had long spec- 
ulated on the use of secure coprocessors for computational enforcement of rights 
management. We carried this speculation one step further, by completing an 
implementation on a commercial platform, that could (in theory) be deployed 
on a wider scale with no additional technological infrastructure. 

One of the purported advantages of the coprocessor approach to secure data 
processing is the ability to insert a computational barrier between the end user 
and the raw data. For this advantage to be realizable, however, this barrier must 
be able to support useful filter computation. By porting the Snort package inside 
the coprocessor environment, we experimentally verified this potential, for the 
application domain of archived network data. 

6.2 Future Steps for Network Vault 

We plan further work both within the application domain of archived network 
data, as well as in other domains. 

With this domain, we immediately plan to expand our prototype along these 
major directions: 

Performance. Currently, the prototype limits the size of requests and archives to 
what can fit inside the coprocessor at one time. This limitation can be overcome 
to enable arbitrary-sized packet dumps and archives to be passed into the vault 
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coprocessors. The upcoming release of Linux for the 4758 (as a replacement for 
CPQ/++) will make this easier, as Linux will provide more abstract host-card 
communication services, which will be useful in implementing a continuous data 
flow through the vault. When this is done, archive size will be decided by the 
long-term storage medium, like CD-ROiVPl. 

We plan to attach the Encoder card to a live packet source (sniffer) to enable 
live data collection. 

Policy. Currently, the capabilities of the policy mechanism are very limited. The 
query language (Snort rules) is very flexible, but the macro limits and authoriza- 
tion procedures are very bland. We plan to implement limits on more parameters 
like packet quantities/rates and number of hosts involved. On the authorization 
side, the first thing would be to implement a history service — keep a log of ac- 
cesses to enable for example a requirement like “packets can only be decrypted 
once” . A full authentication of requests would depend on some authorization 
infrastructure, like those mentioned in Sect. 12. dl For post-processing, we will 
implement scrubbing functions to erase any packet parameters required. 

Longer-term issues include setting up a policy administration system (for 
non-specialist managers), ensuring that access policy is consistent with overall 
goals, and exploring ways to allow access policy to existing archives to change, 
without violating existing policy promises. 

Systems for authorization sooner or later assume that specific individuals 
or roles wield exclusive control over specific private keys. A serious part of a 
fully viable data vault would be a deployable infrastructure that makes this 
assumption true with reasonably high assurance. Ongoing campus PKI work at 
Dartmouth could be very useful here. m 

As an early sketch of this project considered the binding of an archive 
to a single Decoder leaves the door open to denial of service attacks through 
Decoder destruction (just a tamper attempt will do the job). This binding is the 
current way of ensuring that the archive is accessed only through its attached 
policy, and there are no easy alternatives for achieving this assurance. Some false 
starts are secret sharing to backup the decoder private key among some number 
of parties, which leaves the possibility of collusion for illegal access; and backing 
up the key in other 4758’s, which could still be compromised, or conceivably 
fail systematically, and the key lost. Investigating the backup question will be 
important in making the armored vault practically acceptable. 

6.3 Future Application Domains 

The general framework of storing network packets securely and with an attached 
access policy, and using secure coprocessors to enforce this access policy applies 

A fully general solution to this problem broaches the issue of private information 
retrieval — since the user (who wants access to the data) may learn unauthorized 
things by observing how the coprocessor accesses the large archive. In related work, 
we have examined the use of coprocessors for practical solutions to this problem. 
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to access rules for any large data set. An application we are actively exploring 
is that of an Academic PKI. 

Here in the Dartmouth PKI Lab, we have been interested in deploying an 
academic PKI that has at least some assurances that private keys remain private. 

This domain raises many areas where we need to balance community interest 
against privacy interests. For example: 

— The community may decide that certain administrative or law enforcement 
scenarios may require access to data that students have encrypted with their 
keys. 

— Students may lose the authenticators (e.g. passphrases, or smart cards) that 
guard access to their private keys; one preliminary study P| showed this was 
very common in segments of our undergraduate population. 

— Students may encrypt data in the role of an organization officer — but then 
may leave before duly delegating access rights to their successor. 

All of these scenarios require selective weakening of the protections of cryptog- 
raphy, should the community as a whole decide it is justified. Unlike standard 
approaches to key escrow and recovery, our armored vault approach permits 
tuning of access to exactly the community standard — and also provides defense 
against insider attack, since (if the policy and authorization are sound, and the 
Level 4 validation is meaningful), neither bribery nor subpoena can enable the 
vault operator to go beyond the pre-defined access rules. 
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Abstract. Random data perturbation (RDP) method is often used in 
statistical databases to prevent inference of sensitive information about 
individuals from legitimate sum queries. In this paper, we study the RDP 
method for preventing an important type of inference: interval-based in- 
ference. In terms of interval-based inference, the sensitive information 
about individuals is said to be compromised if an accurate enough in- 
terval, called inference interval, is obtained into which the value of the 
sensitive information must fall. We show that the RDP methods proposed 
in the literature are not effective for preventing such interval-based in- 
ference. Based on a new type of random distribution, called e-Gaussian 
distribution, we propose a new RDP method to guarantee no interval- 
based inference. 



1 Introduction 



Conflicts exist between individual rights to privacy and society’s needs to know 
and process information |3|. In many applications, from census data through 
statistical databases to data mining and data warehousing, only aggregated in- 
formation is allowed to be released while sensitive (private) information about 
individuals must be protected. However, inference problem exists because sensi- 
tive information could be inferred from the results of aggregation queries. 

Depending on what is exactly inferred from the queries about aggregated 
information, various types of inference problem have been identified and stud- 
ied. See surveys [[ 3121014 ] . Most existing works deal with either the inference of 
exact values, referred to as exact inference (or exact disclosure), or statistical 
estimators, referred to as statistical inference (or partial disclosure). 

We study another type of inference. Consider a relation with attributes 
(model, sale) where attribute model is public and attribute sale is sensitive. 
Assume that a user issues two sum queries and that the correct answers are 
given as following: (1) The summed sales of model A and model C are 200; (2) 
The summed sales of model A and model B are 4200. Clearly, from those two 
queries, the user cannot infer any exact value of sale. However, due to the fact 
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that attribute sale is positive, first the user is able to determine from query (1) 
that both the sales of model A and the sales of model C are between zero and 
200, and then from query (2) that the sales of model B are between 4000 and 
4200. The length of interval [4000,4200], which is the maximum error of user’s 
estimation of the sales of model B, is less than 5% of actual sales of that model. 
This certainly can be regarded as a disclosure of sensitive information about 
model B. 

The above example indicates that a database attacker is able to infer an 
accurate enough interval when he or she may not infer an exact value of the 
sensetive attribute. We call this type of inference interval-based inference and 
the interval inference interval. We study the methods to prevent interval-based 
inference. 

Interval-based inference poses a different threat to individual privacy in com- 
parison with exact inference and statistical inference. On one hand, exact infer- 
ence can be regarded as a special case of interval-based inference where the length 
of inference interval is equal to zero. This means that exact inference certainly 
leads to interval-based inference while interval-based inference may still occur 
in the absence of exact inference. On the other hand, the statistical inference is 
studied in the context that random perturbation, i.e., noise, is added to sensitive 
individual data. If the variance of the noise that is added is no less than a prede- 
termined threshold, then the perturbed database is considered “safe” in terms 
of statistical inference. Because the perturbation is probabilistic, it is possible - 
no matter how large the variance of the noise is - that some perturbed values 
are close enough to the unperturbed ones, leading to interval-based inference. 

Random data perturbation (RDP) methods (i.e., adding random noise to 
sensetive data) have been studied for many years, especially in the statistical 
database literature, to prevent both exact inference and statistical inference 
| |12I1I11I7I1(18| . For numerical attributes, most RDP methods generate noise 
based on Gaussian distribution iTiroisi . Such method is not effective to prevent 
interval-based inference since arbitrarily small noise could be generated. To il- 
lustrate this, assume that a random noise from Gaussian distribution with mean 
zero and “large enough” variance cr^ = 10^ is added to each model’s sales in 
the above {model, sale) example. Due to the perturbation, it is impossible for 
a database attacker to infer any exact value of model’s sales (i.e., exact infer- 
ence is well protected with random perturbation). The perturbed data is also 
“safe” in terms of statistical inference since the variance is “large enough” . How- 
ever, interval-based inference may still occur in a high risk: the probability of a 
perturbed value being at a distance shorter than 100 from its original value is 
68.27%; and the probability rises up to 99.99% if the distance threshold increases 
from 100 to 400. 

In this paper, we propose to use a new type of random distribution, called 
e-Gaussian distribution, to generate “large enough” random noise in data per- 
turbation. Our RDP method is based on e-Gaussian distribution and able to 
prevent not only exact and statistical inference but also interval-based inference. 
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The contribution of this paper is as follows. First, we illustrate the signifi- 
cance to study interval-based inference. We show that traditional RDP methods 
for preventing exact inference and statistical inference are not effective for pre- 
venting such interval-based inference. Second, we propose a new type of random 
distribution, e-Gaussian distribution, in data perturbation. The RDP method 
based on e-Gaussian distribution is effective for preventing interval-based in- 
ference. Third, we analyze the trade-off among the magnitude of noise, query 
precision, and query set size (i.e., the number of records in a query) in RDP. 

The rest of the paper is organized as follows. In section |3 we define interval- 
based inference and review RDP methods. In section 0 we propose e-Gaussian 
distribution and study the RDP method based on e-Gaussian distribution. We 
also discuss the RDP method based on a variant of e-Gaussian distribution, 
e-S-Gaussian distribution. The paper concludes with a brief summary. 

2 Preliminaries 

In this section, we first define interval-based inference, then discuss the vulnera- 
bility of unperturbed data to interval-based inference, and introduce some basics 
about RDP. 

2.1 Interval-Based Inference 

Consider a sensetive attribute with n individual real values x^, ... ,x^ stored in 
a database. Denote x* = {x^ . . . a;* )^. If the individual values are not perturbed, 
they must be kept secret due to privacy concern while some sum queries need to 
be answered. A sum query (or query for short) is defined to be a subset of the 
individual values in vector x* and the query response the sum of the values in 
the specified subset. 

From the point of view of database users, each individual value x* is secret 
and thus is a variable, denoted by Xi. Let x = {x\ . . .Xn)"^ ■ To model that 
database users or attackers may have a priori knowledge about variable bounds, 
we assume that k < Xi < ri, where k, ri are the lower bound and upper bound 
of Xi, respectively {U and/or n can be infinite). We use I < x < r to denote the 
boundary information {h < Xi < ri\l < i < n}. 

By above notation, each query with its response w.r.t. x* can be expressed by 
a linear equation ajx = bi, where is an n-dimensional vector whose component 
is either one or zero, and bi is a scalar that equals to afx*. Similarly, a set of m 
queries with responses w.r.t. x* can be expressed by a system of linear equations 
Ax = 6, where A is an m x n matrix whose element is either one or zero, and b 
is an m-dimensional vector that equals to Ax*. 

Given a set of queries with responses w.r.t. x*, Ax = b and the boundary 
information I < x < r, interval-based inference is defined as follows. 

Definition 2.1. (Interval-based inference of a;/.) There exists an interval 
Ik = such that (i) the length of the interval h, \h \ = — a:™*", 

satisfies \Ik\ < Ck', (ii) in all the solutions for x that satisfy both Ax = b and 
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I < X < r, variable Xk has values in the interval Ik] (iii) a;™*", and xl 

are three of these values (in the interval Ik), where is a predefined threshold 
called tolerance level. 

In the above definition, tolerance level Ck (usually Ck ^ Xk — lk, k = 1, . . . ,n) 
is used to indicate whether variable Xk is compromised in terms of interval- 
based inference. We assume that ei = . . . = e„ = e unless otherwise indicated. 
Interval Ik = [a:™*”, a:™“^] is called the k-th inference interval (or inference 
interval for short). The endpoints a:™*" and a;™““ of Ik can be derived as the 
optimal objective values of the following mathematical programming problems 
(MPs) and respectively: 



minimize Xk maximize Xk 

subject to Ax = b subject to Ax = b 

I < X < r I < X < r 



Due to the fact that the feasible set P = {x\Ax = h,l < x < r} oi the above 
MPs is a nonempty bounded polyhedron (note that x* must be in P), the values 
^min must exist uniquely. 

In the case of e = 0, variable Xk has the same value x^ in all the solutions 
that satisfy Ax = b and I < x < r. Therefore, exact inference is a special case of 
interval-based inference where the tolerance level e is zero. 



2.2 Vulnerability of Unperturbed Data to Interval- Based Inference 

Database attackers can apply mature techniques in mathematical programming 
to attain interval-based inference. Specifically, database attackers solve a series of 
mathematical programming problems and to obtain inference 

intervals \Ik\ = xff^^ — a:™" and determine whether \Ik\ < e (k = 1, . . . ,n). Due 
to the fast development of computing techniques, mathematical programming 
(i.e., linear programming) problems like and can be regularly 

solved in hundreds or thousands of variables and constraints, and a large number 
of free codes and commercial software packages are currently available in this 
area 0. It is feasible for database attackers to compromise sensitive information 
by using dedicated computing resources. 

However, it is usually difficult to apply the same techniques to checking of 
interval-based inference. In doing so, we must keep recording the entire history of 
queries for all users and compute inference intervals the same way as all possible 
database attackers do. In large database systems with many users, especially in 
on-line environments, such auditing method is practically infeasible. 

2.3 Random Data Perturbation (RDP) 

Researchers in statistical databases have developed random data perturbation 
(RDP) methods to protect sensitive information about individuals from sum 
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queries. The basic idea is very simple: adding random noise to each variable 
and response user queries according to the perturbed data. For numerical (real) 
variables, this is expressed by + Cfc (fc = 1, . . . ,n), where the random 

perturbations are assumed to be independently distributed with the same 
mean zero and variance a^. The perturbation is done once and for all and all 
sum queries are computed as functions of the perturbed data x = (xi . . . x„)^. 

Note that Xk and are fixed after the perturbation. In the rest of this paper, 
we use Xk (or e^) to represent both the variable before the perturbation and the 
value after the perturbation if no confusion arises. 

RDP is effective to prevent both exact inference and statistical inference. On 
the one hand, exact inference is impossible due to the perturbation. On the other 
hand, statistical inference is eliminated (in a statistical sense) if the variance cr^ 
of noise is larger than a predefined threshold. 

Since generation of random noise is relatively easy, RDP is suitable for large 
database systems with many users, even in on-line environments. Once the per- 
turbation is done, database users can issue any legitimate queries and obtain 
responses; thus, database usability and performance are maximized. The down- 
side of RDP is that query precision is degraded; query responses are not 100% 
correct. Another problem is the risk of introducing bias in query responses. We 
do not address the bias problem in this paper. The interested reader is referred 
to ftliOj for more details. 

3 RDP for Preventing Interval-Based Inference 

We consider the RDP method that uses Gaussian distribution with mean zero 
and variance cr^ to generate random noise imnii. As discussed in introduction, 
this method can not be used to prevent interval-based inference since arbitrarily 
small noise could be generated and added to some variables, leading to a high risk 
of interval-based inference. We propose to use a new type of random distribution, 
called e-Gaussian distribution, to generate random noise in data perturbation. 
Consequently, the RDP method based on e-Gaussian distribution is able to 
prevent not only exact and statistical inference but also interval-based inference. 

3.1 e-Gaussian Distribution 

Given a tolerance level e, we define e-Gaussian distribution based on Gaussian 
distribution. The cumulative distribution function (CDF) of e-Gaussian distri- 
bution is as follows (see figure mb)): 

( F{t) if f < — e 

F’e(f) = < 0.5 if — e < t < e (1) 

[ F(t) ift>e 

where F{t) = 0.5 * + 1) CDF of Gaussian distribution with mean 

fj, and variance tr^ (see figure UKa)). Here erf{t) = fg e ^^ds is the error 
function. 
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Fig. 1. CDF functions 



1. generate a random number t by Gaussian distribution with mean p and 
variance cr^; 

2. if t < fj. — e then return t\ 

3. if t> ii-\- e then return t\ 

4. if/r — then return /x — e; 

5. if/r<t</r-|-e then return 

6. if t = /r then return fi — e or fi + e with the same probability 0.5. 



Fig. 2. Procedure to generate e-Gaussian distribution 



The e-Gaussian distribution can be generated by the procedure shown in 
figure phased on Gaussian distribution that has mean fi and variance Denote 
the mean of e-Gaussian distribution and the variance Ug. We have 

Me = M (2) 

J /F— e 

^ + {e^ - u'^)- erf{-^) + J^a ■ e- (3) 

V2cr V 7t 

where f{t) = ^ is the probability density function of Gaussian distribution 

with mean /r and variance tr^. 

Now we discuss the characteristics of the variance of e-Gaussian distribution. 
It is easy to know that is a monotonic increasing function of while e is fixed 
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and that lirricr^^aoO'^ = oo and limrj 2 ^Qa‘f = e^. We have < oo, which 

means that is the lower bound of the variance of e-Gaussian distribution. On 
the other hand, if is fixed, is a monotonic increasing function of e and 
lim^^aoO'e — = (7^. Therefore, Gaussian distribution can be 

considered as a special case of e-Gaussian where the tolerance level e is zero. 

3.2 RDP Based on e-Gaussian Distribution 

Gonsider the RDP method based on e-Gaussian distribution. The random per- 
turbation is expressed hy Xk = x’^ + Ck (fc = 1, . . . , n), where the random noise 
Cfe is chosen from e-Gaussian distribution with mean zero and variance er^. The 
random noise are independent for different k's. The definition of e-Gaussian 
distribution implies that 

\xk-xl\>e, k=l,...,n. (4) 

Proposition 3.1. The RDP method based on e-Gaussian distribution guaran- 
tees no interval-based inference. 

Proof. User queries are based on the perturbed data x rather than the original 
data X*; therefore, inference interval must contain x^- For a given tolerance 
level e, from formula (4) we know that if \Ik\ < e, then must not belong to 
Ik', similarly, if £ Ik, then \Ik\ > e. Therefore, according to definition 12.11 the 
RDP method based on e-Gaussian distribution guarantees that no interval-based 
inference exists. □ 

To compare the RDP method based on Gaussian distribution (with mean 
zero and variance <j^) with the one based on e-Gaussian distribution (with mean 
zero and variance erf), we define probability of inference w.r.t. variable Xk, de- 
noted by P{\xk — xjl < e}, to be the probability that the perturbed value Xk 
is at a distance less than the tolerance level e from its original value x^,. For 
the RDP method based on e-Gaussian distribution, this probability is zero by 
proposition rm For the RDP method based on Gaussian distribution, however, 
the probability of inference is F{e) — F{—e) = where F{t) is the GDF 

function of Gaussian distribution. If e = u, then the probability value is 68.27%; 
if e = 4cr, the probability of inference rises up to 99.99%. Such a high probability 
is unacceptable in practice because a large percentage of the perturbed data is 
vulnerable to interval-based inference. 

There exists trade-off among the magnitude of noise, query precision, and the 
size of query set in RDP. Given a query set S that is a subset of {1, ... , n}, the 
corresponding sum query and query response can be expressed by J2k^s = b. 
Due to perturbation, the query response b is different from its true value b* = 
J2kes where x^. is the original value of Xk- Denote [S'! the size of query set S. 

Proposition 3.2. The RDP method based on e-Gaussian distribution implies 
that the response to a query b = J2k^s^k satisfies 

6i2|S'| 



p{\b-b*\>e-\s\}< 



( 5 ) 
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where the left-hand side represents the probability that the error of query response 
exceeds some bound in terms of the size of query set and a parameter 9; the 
right-hand side indicates this probability must be less than a bound that can 
be computed from the variance of perturbation, the size of query set, and the 
parameter 9. 

Proof. Since random variable has mean and variance formula (5) 
can be derived from Chebyshev’s inequality. □ 

Formula (5) can be used to analyze the tradeoff among the variance of per- 
turbation, the size of query set, and the error incurred by queries. On the one 
hand, the larger the size of query set, the smaller the probability of the error in 
query responses. On the other hand, the smaller the variance of perturbation, 
the more precise the query responses. We also note that the variance cannot 
be arbitrarily small due to its lower bound e^. 

Example. Assume that the average sales is 40,000 for a sensetive attribute sale 
in a database. A database administrator decides the tolerance level to be 4000 
(i.e., e = 4000) for interval-based inference. He or she also wishes to control the 
error in query responses with a bound parameter 1000 (i.e., 9 — 1000). Further 
assume that e-Gaussian distribution is used in RDP and that the variance of 
Gaussian distribution, which is used to derive e — Gaussian distribution, is 1000 
(i.e., = 1000). Formula (5) indicates that the error |6 — 6*|/|5'| exceeds 1000 

with a probability smaller than 17/ [S']. As a comparison, if Gaussian distribution 
is used in RDP instead of e-Gaussian distribution, to achieve the same level of 
query precision, the variance of Gaussian distribution must be 17 x 10® and 
the probability of inference is 0.6680. It is practically unacceptable since 66.8% 
of the perturbed data are vulnerable to interval-based inference, whereas this 
probability is zero for the RDP based on e-Gaussian distribution. □ 

Discussion. Formula (5) indicates that the error estimate to the query response 
is in a statistical sense. If database users are not satisfied with this, a correction 
scheme is to be developed. Traub et al. m proposed the following remedy 
scheme: (i) monitor the error in each query and check whether it exceeds a 
predetermined threshold; (ii) if the error threshold is exceeded and if the query 
size is large enough, simply perturbs the query response by adding some random 
noise until the error of the query is within the error threshold. It was pointed 
out that by doing so a statistical inference is possible, but the cost can be made 
exponential in the size of the database. The same scheme and argument are also 
applicable to the case of interval-based inference. The reader is referred to El 
for more details. 



3.3 Variable Tolerance Level 

In the above discussions, we have assumed that the tolerance level is constant. 
If we use variable tolerance level (e.g., = r • (fc = 1, . . . , n) where r is a 

ratio), r)ror)osition i:t.2l shou1d be accordingly revised. Let denote the variance 
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1. generate a random nnmber t by Gaussian distribution with mean /r and 
variance cr^; 

2. if t < fj. — e then return t\ 

3. if t > /r + e then return t\ 

4. if/r — then return /r — e + -{t — fi); 

5. if fi<t<^ + e then return ^ + e + — ^); 

6. if t = /r then return jj, — e or fi + e with the same probability 0.5. 



Fig. 3. Procedure to generate e-ti-Gaussian distribution 



of random noise {k = 1, . . . ,n). The trade-off among the magnitude of noise, 
query precision, and the size of query set is described in the following proposition: 

Proposition 3.3. In the case of variable tolerance level, the RDP method based 
on e-Gaussian distribution implies that the response to a query b = 
satisfies 

P{\b-b*\>0-\S\}<^0^ (6) 

Proof. Since the random variable has mean and variance — 
formula (6) can be derived from Chebyshev’s inequality. □ 

3.4 e-(5-Gaussian Distribution 

According to the definition of the RDP method based on e-Gaussian distribution, 
the probability that a random noise value equals to — e (or e) could be close to 

(but less than) 0.5. In an extreme case where = 0, the probability is exact 

0.5. As a consequence, if the tolerance level is known to a database attacker, he 
or she may simply use either Xk — e or Xk + e to guess the original value 
leading to revelation of the exact value x^ with a probability close to 0.5. 

To deal with this problem, we consider a variant of e-Gaussian distribution, 
called e-S- Gaussian distribution. We add one more parameter 6 in generation of 
e-(5-Gaussian distribution: a random noise from Gaussian distribution is first gen- 
erated; if the noise falls in the interval [— e, 0] (or [0, e]), it is linearly transformed 
to another interval [— e — 6, — e] (or [e, e -I- 5]), where 5 is a positive parameter. 
By doing so, the transformed noise is not only “far away” from zero but also 
has zero probability to be equal to any particular value. The GDF function of 
e-(5-Gaussian distribution is continuous (see figure n](c)). It is thus impossible to 
use Xk — e (or Xk + e) to infer the exact value xj. 

The e-(5-Gaussian distribution can be generated by the procedure shown in 
figure 0, which is based on Gaussian distribution with mean fi and variance cr^. 
Let pLeS denote the mean of e-(5-Gaussian distribution and the variance, we 
have 
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(7) 



a^S — + f ((e {t — ~ (t — fif) ■ f{t)dt + 

J fi—e ^ 

/•M+e S 

/ {{e+-{t-n)r-{t-nr)-f{t)dt 

jfj, £ 

9 , CT • 

= cr^ -b 2\ —a6 + \ —it ■ a 2a ■ 5) ■ e ^ + 

V 7T V vr e 

2 I 2'i ^ \ 



( 8 ) 



Clearly, e-Gaussian distribution is a special case of e-(5-Gaussian distribution 
where 5 is zero (and Gaussian distribution is in turn a special case of e-Gaussian 
distribution where e is zero) . Previous discussions about the RDP method based 
on e-Gaussian distribution can be easily extended to the RDP method based on 
e-i5-Gaussian distribution. 



4 Summary 

In this paper, we studied random data perturbation methods for preventing 
interval-based inference in sum queries. Our study showed that traditional RDP 
methods for preventing exact inference and statistical inference are not effective 
in this case. To eliminate interval-based inference, we proposed a new type of 
random distribution, e-Gaussian distribution, to generate random noise in data 
perturbation. We also analyzed the trade-off among the magnitude of noise, 
query precision, and query set size in RDP. 
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Abstract. I present a traffic analysis based vnlnerability in Safe Web, an 
encrypting web proxy. This vulnerability allows someone monitoring the 
traffic of a SafeWeb user to determine if the nser is visiting certain web- 
sites. I also describe a snccessful implementation of the attack. Finally, I 
discuss methods for improving the attack and for defending against the 
attack. 



1 Introduction 

Although encryption can hide the contents of data sent on the Internet, people 
often forget that encryption does not hide everything. Somebody eavesdropping 
on an encrypted conversation can still tell who is communicating and how much 
data is being transferred. For example SafeWeb, an encrypting web proxy, is 
vulnerable to an attack that can be done by simply looking at the amount of 
encrypted data that is transferred. This vulnerability allows an eavesdropper, 
such as a government, to tell if the people it is monitoring are visiting certain 
websites. This vulnerability has been successfully implemented on a small net- 
work and can probably be used on an entire country. However, there are several 
practical methods that can be used to adequately protect against this vulnera- 
bility. 

2 Definition of Traffic Analysis 

The process of monitoring the nature and behavior of traffic, rather than its 
content, is known as traffic analysis |Q. Traffic analysis usually works equally 
well on encrypted traffic and on unencrypted traffic. This is because common 
encryption methods, such as SSL, do not try to obfuscate the amount of data 
being transmitted. Because of this, traffic analysis can usually tell you not only 
who the receiver and sender of the data is, but also how much data was trans- 
ferred. In certain situations, an attacker having knowledge of the amount of data 
transferred can have disastrous results. 

3 SafeWeb 

SafeWeb is an encrypting web proxy 0. It attempts to protect its users from 
both the websites they are accessing, and also from anyone who is monitoring 

R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. 2003. 

© Springer- Verlag Berlin Heidelberg 2003 



172 



Andrew Hintz 



their network connection. It attempts to hide the identities of its users from the 
web servers that they are accessing. It does this by not only making webpage 
requests on behalf of its users, but by also rewriting potentially exposing con- 
tent. However, specially crafted JavaScript can be used by malicious websites to 
circumvent this type of protection provided by Safe Web 0. 

Safe Web also attempts to prevent someone who is monitoring the network of 
a Safe Web user from determining what the user is viewing. It uses a combination 
of SSL and JavaScript to encrypt webpage content and the URLs for the pages 
being viewed by the user. Because of this, an attacker monitoring the network 
connection of an end user cannot determine the actual content or URLs that the 
user is viewing. Although SSL adequately protects the content of the data, SSL 
does not adequately guard against traffic analysis. 



4 Fingerprinting Websites 

When a user visits a typical webpage, they download several files. A user down- 
loads the HTML file for the webpage, images included in the page, and the refer- 
enced stylesheets. For example, if a user visited CNN’s webpage at www.cnn.com, 
they would download over forty separate files. Each of these forty files has a spe- 
cific file size which is for the most part constant. 

When a user views a webpage using SafeWeb, they still download all of the 
files associated with the page. In a typical browser, such as Microsoft Internet 
Explorer 5, even when using SafeWeb, each file is returned via a separate TCP 
connection. Because each file is transferred in a separate TCP connection, each 
file is also returned on a separate port to the user’s computer, and it is quite easy 
for an attacker to determine the size of each file being returned to the SafeWeb 
user. All the attacker has to do is count the number of bytes that are being sent 
to each port on the SafeWeb user’s computer. 

If someone were to monitor the network of a SafeWeb user as they visited 
a website, the eavesdropper would be able to determine the number and the 
approximate size of the files that a user received. For example, the eavesdropper 
would know that the user created four connections that each received respectively 
10293 bytes, 384 bytes, 1029 bytes, and 9283 bytes. Each of these transfer sizes 
directly corresponds with the size of a certain file that was received by the user. 
The set of transfer sizes for a given webpage comprises that page’s fingerprint. 

Webpages with a large number of graphics, such as the CNN webpage, have 
fingerprints that are composed of many different sizes. The more files in a given 
fingerprint, the larger the chance that the fingerprint will be unique. Let’s do a 
quick estimate of the number of different possible fingerprints. For mainstream 
sites that have a large amount of graphics, we can conservatively estimate that 
there are 20 different files in the page. Let’s say that each of these files has 
a random size between 500 bytes and 5000 bytes. This means that there are 
approximately 4500 different sizes that each of the 20 different files can be. 
Raising 4500 to the 20th power gives us that there are perhaps 10 to the 73rd 
different possible fingerprints. This number is much, much larger than the total 
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number of webpages currently on the world- wide- web. However, remember that 
the 10 to the 73rd number only applies to websites that have approximately 20 
files associated with them. Websites that are purely HTML and do not reference 
any other files, such as graphics, would probably not have a unique fingerprint. 
This is because, using our previous estimate, there would be only about 4500 
different fingerprints for websites that are composed of only one file. There are 
certainly more than 4500 text-only webpages on the world-wide-web, so not all 
of the fingerprints for text-only webpages are unique. 



5 The Real World Threat 

Several governments of the world consider the viewing of certain content on the 
Internet to be illegal. For example, the Chinese government considers the viewing 
of any dissident political ideas to be illegal 0 . Because the Chinese government 
controls all Internet connections into and out of the country, they can not only 
monitor all Internet communication with computers outside of China, but also 
block any outside sites. The Chinese government has blocked websites such as 
CNN, the BBC, and the New York Times jS]. A common use of Safe Web is to 
circumvent this blocking of websites. Because Safe Web hides both the contents 
and the URLs of the final site being visited, the Chinese government can not 
easily tell what website any Safe Web user in China is viewing. 

Using the previously mentioned file size fingerprinting system, the govern- 
ment could generate fingerprints for all illegal websites that it knows about. 
It could then watch all traffic for these fingerprints. Users whose traffic pat- 
terns sufficiently match the fingerprint for a banned website are then known to 
be viewing the banned websites. Although the government would have a huge 
amount of traffic to analyze, https traffic comprises only a very small portion of 
all Internet traffic. Also the government would have to periodically generate new 
fingerprints, because many blocked sites are news sites whose content changes 
frequently. 



6 Implementing a Fingerprinting Attack 

In order to test the feasibility of the fingerprint attack, I decided to actually 
implement the attack. In order to implement the attack, I first created a pro- 
gram that analyzes a tcpdump log and generates a fingerprint of the https 
traffic in the log. The program creates the fingerprint by calculating the to- 
tal amount of https data that is sent to the user on each of the user’s ports. 
The tcpdump log can be generated by any computer which can monitor the 
traffic being sent to the SafeWeb user’s computer. The implementation of the 
attack and a few example tcpdump log files are available on my website at 
http : / / guh . nu/pro j ects/ta/ saf eweb/ 

As an example, for a SafeWeb user visiting cnn.com on November 6, 2001, 
the following fingerprint was generated: 
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size: 538 count : 1 

size: 555 count: 2 

size: 563 count : 1 

... [34 lines of data have been removed] ... 

size: 12848 count : 1 

size: 18828 count : 1 

size: 39159 count : 1 

total number of different sizes: 40 

Size is the amount of data in bytes received on a specific port and count is the 
number of times this specific data size was seen in the log file. 

The fingerprinting program can also determine how similar two different fin- 
gerprints are. It does this by counting the number of exact file size matches in 
two fingerprints. Here are the results when comparing the fingerprints of two 
different users visiting cnn.com a few hours apart: 

Number of connections in the file "cnn.com": 43 

Number of connections in the file "cnn.com2": 42 

Number of exact matches: 32 

Here is the output comparing a user visiting cnn.com with a user visiting 
bbc.co.uk: 

Number of connections in the file "cnn.com": 43 

Number of connections in the file "bbc.co.uk": 38 

Number of exact matches: 2 

Several pages were tested to verify that they could be easily fingerprinted. 
The pages I examined were cnn.com, bbc.co.uk, nytimes.com, slashdot.org, and 
washingtonpost.com. I visited each of these pages in Microsoft Internet Explorer 
5 using SafeWeb and kept a separate tcpdump log of each site I visited. About 
an hour later, I repeated the same process with a different computer viewing the 
same webpages. 

When I compared the fingerprints for sites that were the same, the smallest 
number of exact matches I found was 21. The smallest matches-to-connections 
ratio was 25 to 55. In other words, at least 45% of the connections were al- 
ways exact matches if the two fingerprints were of the same websites. However, 
typically 75% of the sizes were exact matches. 

These numbers only have value when compared to the number of false file 
size matches. I compared each fingerprint of each site to each fingerprint of 
each different site. The most number of false file size matches that I got when 
comparing fingerprints of different sites was 2. The largest percentage of false 
file size matches that I got on any given comparison was 6%. However for sites 
that were different, usually only either 0 or 1 matches occurred. 

These initial results show that it is possible to maliciously fingerprint web- 
pages on a small scale. In order to determine the difficultly of fingerprinting 
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webpages on a very large scale, further tests need to be done. Extensive, large 
scale tests have not yet been done because SafeWeb has shutdown its free web 
proxying service [2|. 

7 Improving the Attack 

Although the basic attack that has been described is sufficient for matching 
fingerprints on a small scale, the attack may need to be improved in order for it 
to work on a large scale. There are several things which can be done to improve 
the fingerprint attack against SafeWeb users. 

7.1 Analyzing the Order of Transmissions 

When a user visits a webpage, the first file they usually download is the HTML 
file for the webpage that they are visiting. The user’s browser then parses through 
the HTML file and requests any referenced files, such as graphics and stylesheets. 
Each particular browser usually requests referenced files for a particular webpage 
in the same order. An attacker could take this into account and evaluate not only 
the size of the transmissions, but also the order in which they occur. In an ideal 
situation, looking at the order of transmissions would increase the uniqueness 
of a page with twenty files by about twenty factorial, or about 10 to the 26th 
power. However, in order to take maximum advantage of this improvement, an 
attacker would have to generate several fingerprints for each website, because 
the attacker would need one for each different web browsing program. 

7.2 Improving Creation of the Fingerprint 

There are several things that can be done in order to improve the accuracy of 
the creation of the initial fingerprint. Because each fingerprint taken inherently 
includes some noise, it would be beneficial to use multiple sets of data in order 
to generate a more accurate fingerprint. One way this could be done is to take 
several fingerprints of the same website as viewed from different computers. All 
of these fingerprints could then be added together. Any file size which occurs 
some minimum number of times could then be considered to be an accurate file 
size and included in the fingerprint which is actually used for the attack. 

7.3 Expanding Fingerprints to Entire Websites 

Another idea for improving the accuracy and completeness of the attack is to 
expand the concept of a fingerprint from just one webpage to an entire website. 
When a user visits a website, they often times visit several pages at the same 
site. An attacker could take this into account by creating a fingerprint which 
contains all the file sizes for all files which are available from a certain website. 

For example, a fingerprint for the cnn.com website could include not only 
the files associated with the main page, but also all the files that are associated 
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with any page linked to from the main page. One thing that should be noticed is 
that most websites share common graphics and stylesheets among several pages 
on the site. When a user visits multiple pages on the same website, the user 
often caches the graphics that they have already downloaded and therefore do 
not need to re-download all the graphics associated with each individual page. 



7.4 Improving Matching 

One way to improve matching two fingerprints together is to not require two file 
sizes to be exactly the same in order to have a match. For example, it could be 
assumed that if two file’s sizes are within 5 bytes of each other then they are 
similar enough and are probably the same file. In order to see if this improves 
the attack, I implemented this range matching method. The range matching 
program first looks for matches that are exact. With the remaining sizes that 
have not yet been matched, it looks for sizes that are 1 byte apart and matches 
those that are. It continues this same process until it reaches the specified range. 
However in my test data, this range matching added roughly twice as many false 
positives as it did correct matches. 



8 Protecting Against Fingerprinting 

There are several practical methods that can be used to help protect against 
website fingerprinting. Some of these must be implemented by the web proxy 
and some can be implemented by the end user. 



8.1 Adding Noise to Traffic 

One way of protecting against fingerprinting is for the proxy to add extra noise to 
the data that it returns to the user. The methods of adding noise described here 
require the proxy to modify the data before returning it to the user. Although 
this may result in a performance hit. Safe Web already modifies the HTML that 
it returns by reformatting all the URLs on the webpage that it returns to the 
user. 



Modify Sizes of Connections. The web proxy could add extra randomly 
sized data to the files that it returns to the user. This extra data can be made 
so that it does not alter the appearance of the webpages that the user views. 
For example, a randomly sized comment could be added into every HTML file 
just after the (html) tag at the beginning of the document. Many image formats, 
such as JPEG, also allow variable sized comments. The more random, extrane- 
ous data that is added to each file, the more the true size of the files will be 
obscured. However, adding extra data to transmissions will increase the amount 
of bandwidth that is required. 
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Add Extra Fake Connections. In order to lower the percentage of connec- 
tions which match a fingerprint, the web proxy could add in extra, randomly- 
sized connections. It could do this by inserting randomly-sized 1 pixel by 1 pixel 
transparent graphics into the HTML document. These extra graphics would be 
most effective if added throughout the file, as opposed to putting them all at 
the beginning or end of the file. This is because a large scale fingerprint matcher 
would probably look for clumps of matches, and adding extra connections to 
the beginning or end of a transmission would not help to break up the clump of 
matches. However this method has several drawbacks. First of all, adding even 
small graphics to some webpages could disrupt the intended layout of the page. 
Also, adding a significant number of sizable, extra connections would require a 
significant amount of extra bandwidth. For example, in order to cut the per- 
centage of matches in half, the proxy would have to approximately double the 
amount of bandwidth that it uses. 

8.2 Reduce Number of Files Transferred 

For a quick and easy solution. Safe Web users could choose to not view graphics 
on the webpages that they visit. This option is already available in most web 
browsers. By choosing not to view graphics, a user would drastically decrease the 
number of files received for most webpages. For example, if a user visited cnn.com 
with graphics turned off, they would download less than 25% the number of files 
that they would have downloaded if they had viewed the site with graphics 
turned on. The number of files transferred could be reduced even further by 
disabling things such as stylesheets and ActiveX controls. This method would 
make each fingerprint have a very small number of file sizes, and therefore most 
likely not unique. However, this would of course severely inconvenience most 
users because they would not be able to view any graphics on the webpages that 
they visit. 

8.3 Transfer Everything in One Connection 

Another approach to protecting against fingerprinting is to make it difficult for 
an attacker to determine the size of each file being transferred by lumping all 
the files together. There are a few methods which might make this possible. One 
method is for a client to not open multiple connections to the same Webserver. 
There is a Windows registry setting for Microsoft Internet Explorer 5 which 
sets the maximum number of simultaneous connections to any given Webserver. 
However this setting does not seem to have any effect when browsing the web 
using SafeWeb. Although this technique would make it more difficult for an 
attacker to find the size of each file being transferred, it may still be possible to 
find the size of each file by looking at the timing of the packets transferred. 

Some servers have the capability to return a webpage and all associated files 
in a single tarball to the user. Using this method, it would be impossible for 
an attacker to determine the size of each individual file. However, neither all 
browsers nor all webservers have this capability. 
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9 Conclusion 

Although Safe Web is no longer open to the publi(0 the ideas presented can be 
applied to other encrypted web proxies. The issue that the size of data is often 
not obfuscated by typical cryptography is something to also keep in mind in 
areas other than proxies. For example, there is a vulnerability in some versions 
of SSH where an attacker watching a connection can determine the size of the 
password being used. This is due to the fact that the size of the password is not 
obfuscated 0. 
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Abstract. Several active attacks on user privacy in the World Wide 
Web using cookies or active elements (Java, Javascript, ActiveX) are 
known. One goal is to identify a user in consecutive Internet session to 
track and to profile him (such a profile can be extended by personal 
information if available). 

In this paper, a passive attack is presented that uses information of a 
different network layer in the first place. It is exposed how expressive the 
data of the HyperText Transfer Protocol (HTTP) can be with respect to 
identify computers (and therefore their users). An algorithm to reiden- 
tify computers using dynamically assigned IP addresses with a certain 
degree of assurance is introduced. Thereafter simple countermeasures are 
demonstrated. 

The motivation for this attack is to show the capability of passive privacy 
attacks using Web server log files and to propagate the use of anonymis- 
ing techniques for Web users. 

Keywords: privacy, anonymity, user tracking and profiling. World Wide 
Web, server logs 



1 Introduction 

The necessity for the protection of user privacy on the Internet is hard to under- 
stand for the normal user in general; their opinion is that “there is nothing to 
hide” . The majority of the average Internet users are using the World Wide Web 
(short: WWW) as their main Internet servic^ without being conscious of the 
fact that they give away plenty of information while browsing. This information 
is given both implicitly and explicitly: 

— Protocol information 

While communicating, the Web browser of the user and the Web server of a 
content provider are exchanging data via the HyperText Transfer Protocol 
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(HTTP) p. Besides the information necessary for accessing a Web resource 
(Web page, video clip, mp3 file, . . . ) additional information is transferred 
to influence the reaction of the server (e.g. choice of the preferred resource 
language, compression type, etc.). This data is transferred within the HTTP 
headers (see Section 0 and can be very expressive. 

— Personal information 

Using the Internet means searching for information (in most cases). From 
time to time, every user also gives away some of his personal data voluntarily 
(surveys, etc.). 

To trace users over a long time and consecutive Internet sessions, the IP 
address of the user’s computer is often used and therefore logged (besides some 
protocol information). Each IP address, identifying a computer uniquely on the 
Internet, can be used to reidentify a computer (and therefore its owner). 

Most Internet users do not have a dedicated Internet connection but dial in 
via modem to their Internet Service Provider (ISP). In general, those ISP have 
less IP addresses available than they have customers. Hence, the customer’s IP 
address out of a pool of available addresses is assigned dynamically at the time 
of dialing in; this assignment is therefore quasi-random. The probability that 
the same IP address is assigned to the same customer the next time he dials in 
depends on the ISP’s number of customers and the size of the IP address pool. 
Only the ISP is able to reassign an IP address to a customer after his Internet 
session has ended. 

It is often heard that privacy of these users is sufficiently secured against 
third parties interested in profiling m- 

In the following it is shown, that this opinion is a misjudgement and a passive 
attack against the privacy of Web users with dynamically assigned IP addresses 
is presented based only on protocol data that can be logged by every common 
Web server. 

The goal of this attack is to identify users with dynamically assigned IP 
addresses between different Internet sessions. 

The motivation for this attack is to show the necessity of the use of anonymis- 
ing techniques even for the normal Web user. 

2 The HyperText Transfer Protocol 

Web browser and Web server are communicating using a standardised protocol, 
the HyperText Transfer Protocol (HTTP) p. Each request for a Web resource 
issued by a Web browser and addressed via a so called Uniform Resource Locator 
(URL) is answered with a response by the server. The server is normally waiting 
for requests at port 80; the requests consist of normal ASCII text, can therefore 
easily be read by humans, and are structured into fields. Besides the type of 
access to a Web resource (GET, PUT, POST, . . .) the URL and an arbitrary 
number of fields are transferred. These fields contain a name and corresponding 
content. As an example. Table [Dshows the HTTP header fields of a request for 
the resource at URL http://www.amazon.com/ using an Opera Web browser. 
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Table 1. HTTP header content for a request to http://www.amazon.com/ 



GET http://www.amazon.coin/ 

Cache-Control: no-cache 
Connection: Keep-Alive 
Pragma: no-cache 

Accept: text/html, image/png, image/jpeg, image/gif, image/x-xbitmap, */* 
Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0 
Accept-Language : en 
Accept-Charset : iso-8859-1 , * ,utf-8 
Host: www.amazon.com 

User-Agent: Opera/5. 0 (Linux 2.2.16 i686; U) [en] 



Table 2. Most often nsed HTTP header fields 



HTTP/ 1.1 header fields: 


Field name 


Meaning 


User- Agent 


Contains information about the client’s conhguration 


From 


Contains the clients’s email address 


Accept 


The accepted media types (also an index to 
installed software) 


Accept-Language 


Specifies the languages the client is willing to accept 


Accept-Encoding 


The accepted encoding (compression type, etc. ) 


Accept-Charset 


What character sets are acceptable for the response 


Method 


{OPTIONS, GET, HEAD, POST, PUT, DELETE, 
TRACE, CONNECT} 


Host 


Contains the server’s hostname 


Via 


Can contain a specification from the proxy which was 
used to transfer the request 


If-* 


A set of helds for conditional access to resources 
(with reliance on the resource’s time of creation, 
modihcation, etc.) 


Besides the information above also the type of 


Server-Protocol (HTTP/1.0 or HTTP/1.1) is transferred 



The meaning of the the most important and most often used header fields is 
explained and commented in Table 0 

The content of the fields varies from browser to browser and depends on the 
browser’s underlying operating system, language preferences set by the user of 
a browser, software installed on the user’s computer, and many more individual 
conditions. 

Each request for a resource is normally logged by the Web server. Most Web 
servers are configured to do that by default and most of the server adminis- 
trators are not interested in changing this behaviour due to lack of interest or 
competence. Organisations caring about the privacy of Internet users like Pri- 
vacy Watch (and also federal privacy officers) criticise this behaviour. 

Logging data consists of the URL, date, time, and the most often used HTTP 
header fields in general. To enhance the amount of logged data to all HTTP 
header fields only little configuration of the Web server is necessary. 
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3 Problems of Passive Attacks 

Passive attacks are based on analysing logged data. This data is very easily 
available. Even if an attacker is not owner of a Web server but only using hosting 
services of an ISP, an attack is easy to perform: Nearly all ISP send the recorded 
access data to their customer periodically. This log data does not only contain 
the accesses to the customer’s Web pages but in most cases also statistics about 
time, geographical region, and domain of the Web page’s visitors. 

Focus of a passive attack is the IP address of a user’s computer. If this address 
is static and therefore does not change over time, it is very simple to identify 
and trace a person. 

But the success of this method is restricted: The address space of the current 
Internet Protocol is limited. In earlier years it was not unusual for a customer of 
an ISP to get a static IP address. But more and more IP addresses are assigned 
dynamically by ISPs. In most cases, the Internet Protocol Control Protocol 
(IPCP, (used by the Point- To-Point-Protocol, PPP) is used to handle the 
process of dial in at the provider; PPP can also perform the assignment of 
IP addresses to the client. Each time a customer dials in, a currently unused 
IP address is taken from the pool of addresses belonging to the provider; this 
process is nearly pseudo random. 

Because some ISPs have more than a million customers (like AOL in the United 
States or T-Online in Germany) the number of IP addresses available does not 
belong to a contiguous range. Furthermore, even IP addresses assigned in two 
consecutive Internet sessions do not have to match in any position. 

As a conclusion of the former explanations it is obvious that user tracking 
by IP address is futile. 

4 Tracking Web Users by HTTP Headers 

The protocol information of another OSI layer could also be used to track users. 
Regarding the most often used Internet service, this would be HTTP. While IP 
addresses can change from session to session, some HTTP header fields have to 
stay unchanged to warrant the intended operation and communication between 
Web client and Web server. But even if relying on a certain set of mostly in- 
variant HTTP headers, slight modifications of browser configurations have to 
be tolerated. This is important because users modify the browser settings di- 
rectly and installed or removed software can have an effect on HTTP headers 
alternatively or additionally. 

A successful attack on the privacy of a Web user can be performed as follows: 
Given a line of server log data not mandatory containing IP addresses and de- 
scribing one access to a Web page by a person using a Web browser and a set of 
such protocol lines, possibly containing other accesses by this user, these accesses 
can be identified with a certain degree of reliability. The line of log data consist 
of some meta information (date and time of the access) and the content of a set 
of chosen HTTP header fields. 
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4.1 Terminology 

The term Access Data Set (ADS) shall describe a set of 

— a timestamp ts describing date and time of access to an URL and 

— a set of terms Up, ■ ■ • , the contents of a selected number of HTTP 
header fields {/ii, . . . , hm}- The number of terms a header field can contain 
depends on the type of the header field. 

An extended Access Data Set (eADS) extends an ADS by additional personal 
information (e.g. given to a Web form) by the corresponding user. In realiter it is 
interesting for an attacker, possessing an eADS (and a great number of accesses 
to different URLs on different Web servers) to (re)identify a user to connect this 
eADS to other (e)ADS. This gives him the opportunity not only to track the 
user but also to profile him. Depending on the computer performing even real 
time tracking is possible. 

An instance synonymously means a person using a Web browser to access 
Web pages and likewise this particular Web browser implicitly defined by its 
configuration. 

Instances generate (e)ADS while accessing Web pages. 



4.2 Identifying Sensitive HTTP Information 

The goal of the attack is to identify instances by their ADS (and therefore by 
HTTP headers). Which HTTP fields are the most expressive and are worth to 
be considered nearer? 

Type 1: As mentioned, some header fields are intended for transporting in- 
stances between server and client only and can not be seen at the end points of 
the communication. 

Type 2: Other header fields can only contain a few different terms like the 
HTTP-Version field, which can only include ’’HTTP/1.0” or ’’HTTP/1.1”, or 
the Method field, which can contain an entry of the set {OPTIONS, GET, HEAD, 
POST, PUT, DELETE, TRACE, CONNECT}. An analysis of the ADS base used to 
verify the attack in Section O showed, that in more than 99% of the ADS only 
an element of {GET, POST} is used. 

In general, the more terms a header field can contain the more individual 
and expressive it can be: A header field that can span over up to p terms out of 
n possible terms with the restriction, that every term can be included at most 
once, can “mark” ADS individually. 

Type 3: A deeper analysis can be used to identify a number of header fields 
much more individual. These are fields like User-Agent or Accept. These fields 
are very invariable over a long time (at least until an instance changes its oper- 
ating system or the browser type) . 

Table 0 shows the results of an analysis of HTTP header fields used in the 
attack based on an ADS set described in Section 14.31 the number of different 
terms in each field is listed. 
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Table 3. Used HTTP headers and number of different terms in the ADS base 



Field name 


Term count 


Field name 


Term count 


Host 


459 


Trailer 


1 


User- Agent 


318 


Warning 


1 


Server-Protocol 


5 


Via 


1248 


Accept 


32 


Range 


19 


Accept-Language 


159 


If-Range 


17 


Accept-Encoding 


7 


If-Match 


2 


Accept-Charset 


5 


If-None-Match 


21272 


Method 


4 


If-Modified-Since 


1 


Expect 


0 


If-Unmodified-Since 


2 


From 


10 


Sum 


23562 



Definition: A frequency of a term is defined as the quotient of the number of 
occurences of this term and the number of all considered terms. 

Even if a header field can contain one or more of a high number of terms, a 
term’s frequency in a set of ADS is determining the significance of this term and 
the significance of an ADS containing this term. A field containing terms evenly 
distributed is only as significant as a field of type 2. 

ADS of the same instance vary over time: New software is installed which 
can have impact on the Accept field, a proxy can be used (which modifies the 
Via header and can extent the User-Agent field), or the language settings are 
modified which effects the Accept-Language field, etc. 

To trace the Web page accesses of instances, it is intended to determine cor- 
relations between different ADS. This situation can be equated with the problem 
to find documents in a document database. There, for a query formed of key- 
words, the most relevant results shall be found. This problem has been solved 
in the field of Information Retrieval. 

4.3 Using Methods of Information Retrieval 

Information Retrieval is the field of searching and finding of information in large 
data pools. Not every attribute or bit of information is relevant for searching; 
because of redundancy in data objects not the objects themselves but represen- 
tations are stored in a database, the objects have to be filtered. These represen- 
tations must describe the data objects as good as possible. On the other hand, 
queries to the database must be formed equally to fit for the representations, a 
transformation function is necassary for both requirements. 

In our attack, the data objects are the ADS an adversary has collected, the 
query is represented by an (e)ADS of an instance that shall be (re)identified and 
tracked. 

The filter is identical to the consideration of only the relevant HTTP header 
fields as explained in Section El The transformation function is described in the 
following as part of the complete algorithm. 
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4.4 Preparations 

The results of our simulated attack are based on an ADS set of 2.7 billion Web 
accesses. These are Web accesses to the same Web resource. In addition to the 
HTTP header fields, date, time, and IP address of the accessing client have been 
loggecfl. 



4.5 Algorithm 



Analysis of the ADS Base. The ADS base is analysed entry by entry, consid- 
ering only the relevant header fields. The content of each header field hj is parsed 
to identify the field terms tjp, . . . For each header field an ordered list of 
different terms found is kept. If a new term is found it is added to the header 
field term list, each term already known increments the counter cnt(tj i^) of 
the correspondent entry in the list. The sum of all possible terms of a header 
field hj is therefore cnt(hj) = cnt{tjj). 

The function lj{a) is the number of terms in the header field hj of a certain 
ADS a. 

The significance of a term is identical to its inverse frequency in all data 
objects, in our model in all ADS. In correspondence to m the so called term 
weight is: 

weight{tj^k) = -Id ( ) 

V cnt{hj) J 

The terms found are also called index terms because each ADS in the data 
pool can be described as a combination of these terms. 

The difference of term weights considering one example header field (here: 
User-Agent) are graphically displayed in Figure ^ The impact of the term 
weight is obvious: The more common a term the less its weight. 

In addition, another measure has to be defined to estimate the weight of an 
ADS. It has to be considered, that ADS can consist of a different number of 
terms. To avoid, that a concise ADS with “heavy” but only a few terms has an 
equal weight as a common ADS, the weight of an ADS a is defined as follows: 



weight' (a) 



n 



E 



Efclf weightjtj^k) 

lj{a) 



with n being number of header fields of ADS a. 



Construction of Index Vectors. As mentioned, the Information Retrieval 
systems stores a representation of the data objects (here: the ADS), therefore a 
transformation function is necessary (as for the transformation of ADS to trace). 
Each ADS in the information system is represented by a binary index vector for 
two reasons: 

All users have been informed about the logging and its purpose for scientific research. 
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Fig. 1. Term weights of header field User-Agent 



1. More efficient storage 

Binary index vectors can be stored as sparse vectors. In our attack, the 
length of an index vector is 23623 where only 13 to 42 positions are set to 1. 

2. Calculation speed up 

The determination of the similarity of two ADS can be sped up while using 
sparse vectors. 

The length of each index vector lj,y is equal to the sum of all terms of all con- 
sidered header fields. To build index vectors the term lists are concatenated: 
(^1,1 1 ■ ■ ■ 1 tl,cnt{hi) } ^2,1 1 • ■ • I ^2 ,cnt(A 2 )j ^3,ii ■ • The term at position j is called 
tj. 

An index vector of an ADS a is described as follows: 



iv{a) = (6i, . . . , bi-^),bj £ {0, 1} and 



1 : term tj is member of a 

0 : else 



The value of index vector iv{a) at position i is denoted ivi(a) = bi. 



Search for Correlations. For a given probe ADS Uprobe of an instance to be 
(re) identified and tracked the similarity to each ADS of the ADS base has to 
be determined. With reference to 0 the similarity of two data objects is equal 
to the scalar product of the corresponding index vectors considering the term 
weight: 
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hv Uv 

sifni^dproheT — EE ivr(aprobe) * iVs{ai) * weight{tr) * weight(ts) 

r=l s=l 

In our attack, a modification of the formula is necessary: Differences in terms 
are leading to a decrease of the similarity value by the term’s weight (down to 
a value of zero, further differences do not lead to negative values). A similarity 
of 100% means identity of both ADS; in this case aprobe and have the same 
weighlH. 

This similarity (to the probe ADS) is calculated for each element of the ADS 
base. 

Evaluation and Ranking. A dynamically assigned IP address issued to an 
instance does not normally vary during an Internet session. For simplification, 
an instance is assumed, that is online once a day. 

A lower bound of similarity has to be considered. Below that value ADS found 
are disregarded; their similarity to the probe ADS is too low. This threshold 
value is denominated by Asim and is a relative value (ADS with similarity 
below 100 — Asim are ignored). If an ADS ai is found whose similarity is above 
the threshold, it initialises a potential activity period (PAP), ADS with a lower 
similarity can be neglected. This tolerance is necessary to eliminate variations 
in the ADS of instances over the time (as explained in Section lO) . A PAP 
therefore forms a group of ADS assumed to be generated by the same instance 
at the same online session. Each consecutive ADS a found found which fulfils the 
following criteria is added to this PAP P: 

— The IP address is the same as the IP address of the initialising ADS. 

— The similarity to aprobe is high enough (regarding to Asim). 

simi^Oprobei ajound) 100 Asim 

— The ADS lies within the time window given by the timestamp of the first 
ADS and the window size At. 

Voi G P : ts{ai) > ts{ai) A ts{ai) < ts{ai) + At 

A PAP therefore is a representative for the potential occurrence of activity of an 
instance. Each PAP found by the presented algorithm consists of the initialising 
ADS and the ADS of the most and less similar ADS within the time window. 

Any other ADS found which fulfills the similarity criterion but has a differ- 
ent IP address or does not lie within the time window initialises a new PAP. 
Two PAPs with different IP addresses which are not disjoint regarding the time 
window build a PAP intersectioi^ 

® This relation is not symmetric: Two ADS with the same weight do not have to be 
identical. 

^ An additional feature of the attack is the potential to determine if an instance 
uses statically or dynamically assigned IP addresses (it is no mandatory assumption 
of the attack that the given instance to be tracked uses dynamically assigned IP 
addresses). If all other PAPs found present the same IP address, one can conclude 
that this address is static. 
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It is obvious that the grade of anonymity is dependent on the number of PAP 
intersections of a probe ADS. The more intersections appear while calculating 
the PAPs, the more anonymous is the probe ADS, because each (but only at 
most one) of the found PAPs can be generated by the instance belonging to the 
probe ADS with the same probability. This property is a direct analogy to the 
mix concept developed by David Chaum [ 7 ], where each participant in a mix 
network can be the sender of a message with the same probability. 

The more common the configuration of an instance, the more common are 
the ADS generated by it. But because of this property, there are also many 
instances that are generating such an ADS and therefore the probability of PAP 
intersections is very high for common instances, rsp. their configurations. The 
PAP intersections of a probe ADS are forming an anonymity set for the probe. 

Furthermore, the following restriction is given: 

An attacker can not be sure that a PAP found is generated by the tracked 
instance at all, because the combination of the terms in the header fields is not 
unique. Additionally, an attacker does not know the precision of the retrieval, 
because he can not distinguish between PAPs correctly identified and PAPs 
identified but not belonging to the traced instance. 

4.6 Quality of Found PAPs 

Anonymity is the state of being not identifiable within a set of subjects, the 
anonymity set jH]. In the present scenario of (re)identifying instances this set is 
the PAP found. Being anonymous, all members of this set can represent with 
the same probability the probe ADS. With the explained procedure some of the 
ADS out of the set are more likely identical to the probe ADS (depending on the 
determined similarity). Considering the restriction for eliminating consecutive 
appearances of instances (At) and the maximum allowed relative difference to 
the probe ADS (Asim), one obtains the number of relevant PAP. 

The grade of anonymity is equivalent to the number p of PAP intersections: 
With respect to the definition of the anonymity of a group of senders |S|, the 
anonymity of the probe ADS is denoted as ld{p) (the binary logarithm of the 
number of intersections). 

The number of PAP intersections itself depends on several factors: 

— The time window At 

The higher this value, the more intersections will be found. Because most 
users on the Internet are connected via dial in gateways (and therefore receive 
dynamically assigned IP addresses) they are online once a day. Thus, six to 
24 hours are a appropriate value for Asim. 

— The similarity threshold Asim 

Variations of several (hardware and software) configurations over some time 
are analysed. Evaluations have shown that the ADS weight varies up to 
10.30% with respect to modifications mentioned in 14.21 

The restriction mentioned above is a factor influencing the assertion, that a 
found PAP, even if it is a member of an anonymity set of size one, is generated 
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by the tracked instance. The attacker only knows the weight of an ADS, the 
number of PAP found, and the number of PAP intersections. Therefore he needs 
to use some empirical data. 



4.7 Experiments and Results 



To verify the correct function of the presented algorithm and to ascertain a 
measure for the quality of the found PAPs (and thus for the grade of anonymity), 
a modified ADS set has been created: 

Nearly 300 ADS have been randomly created to build the probe set. Then these 
probe ADS have been “mutated” in a way a real instance would generate varying 
ADS (see Section H.2II resulting in more than 13.000 probe ADS. Additionally, 
these mutated ADS have been marked so that they could be reidentified for 
verification, but without influencing the retrieval algorithm. The mutated ADS 
have been spread over the ADS base resulting in a modified ADS base. 

In a second step the original probe set has been tested against the modified ADS 
base. 

This way, the exact number of ADS to be found for each probe ADS is known 
and allows to judge the quality of the complete attack. 

Two rates determine the accuracy of the retrieval (and therefore of the at- 
tack) : 

— Recall: a measure of how well the attack performs in finding relevant ADS. 
The recall rate is 100% when every relevant ADS is retrieved. 

— Precision: a measure of how well the attack performs in not returning nonrel- 
evant ADS. The precision rate is 100% when every ADS returned is relevant 
to the probe ADS. 

With NRA (Number of relevant ADS), NAF (Number of ADS found), NIF 
(Number of irrelevant ADS found), and NRF (Number of relevant ADS found) 
the rates above are defined as follows: 



recall = 



NRF 

NRA 



NRF 

precision = — — • 

^ NRF -k NIF 

Figure 0 shows the average recall and precision rates of the test set with 
the probe ADS for Asim € {0, 5, 10, ... , 100}%. The attack is able to identify 
with an average recall rate up to 0.98 and with a precision rate up to 0.71. One 
can identify a local optimum at Asim = 35%. The attention has to be turned 
on a high value of precision, because every found PAP has to be stored and 
tracked step-by-step; PAP found with a precision lower than a given bound can 
be discarded. Therefore precision needs to be as high as possible. 

An attacker needs to know how expressive the PAP he found are, knowing 
the PAP intersections count and the weight. Additionally it is interesting for 
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■ recall 
♦ precision 

T -rho(PAP intersections, 
precision) 

▲ rho(weight', precision) 



Fig. 2. Recall, precision, and correlation coefficients 



him, if instance tracking is meaningful at all. He can detect this using empirical 
data like the correlation between the PAP intersection and the precision to be 
expected and the ADS weight and the precision, respectively. The two graphs 
of the corresponding correlation coefficients are also plotted in the figure (see 
p(PAP intersections, precision) and p(weight’, precision j®l) . It is obvious that 
there is a correlation between these properties: 

— The higher the weight of an ADS the higher the precision of the retrieval. 
This validates the assumption that some ADS (and the generating instances) 
are more identifiable than others. ADS with low weight can be discarded from 
tracking. 

— The higher the PAP intersection the lower the precision. 

This is a result of the greater anonymity set for such ADS/instances. 

To visualise these results, two ADS are randomly selected, one with a high 
weight (156.01) and one with a low weight (87.58): 

There is an obvious difference in the precision of retrieval of both ADS. This 
observation becomes clear by analysing the two ADS: Mozilla 4.0, here an alias 
for the Microsoft Internet Explorer, is a very popular Web browser used by most 
of the Web surfers. Also the presented name of the operating systems, Windows 
98, is used by many (non-expert) users in general. Here each instance using this 
configuration can be sure to appear with an often used configuration (resulting 

® due to technical limitations p is indicated as “rho” 
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Table 4. ADS with weight 87.58 and retrieval precision 0.82 



<DATE> <TIME> <H0ST> . dip. t-dialin.net <IP> Mozilla/4.0 (compatible; MSIE 5.5 
Windows 98; Win 9x 4.90) HTTP/1.0 GET */* 



Table 5. ADS with weight 156.01 and retrieval precision 0.97 



<DATE> <TIME> <H0ST> .uni-hamburg. de <IP> Mozilla/4.76 [de] 

(Xll; U; Linux 2.2.10 1686) HTTP/1.0 GET image/gif, image /x-xbitmap, 
image/jpeg, image/pjpeg, image/png iso-8859-1 ,* ,utf-8 gzip de, ex-MX, es, en 



in a low retrieval precision) and to be hidden in a large anonymity set (leading to 
high number of intersections). Is has to be remarked further on, that this ADS 
does not give any additional information like language or media type preferences. 

The other instance is using an operating system for professionals. This scores 
in a high term weight in the User-Agent field. While the media type field 
(Accept) is not very unusual, the Accept-Language header contain a seldom 
term like es-mx standing for “Spanish-Mexican” . Such a configuration can be 
reidentified easily. 

5 Countermeasures 

It has been shown that instances are easier traceable if the variations in the 
ADS they generate are low (with respect to the selected HTTP fields and the 
value for Asim) . Considering the number of PAP intersections as a measure of 
anonymity of an instance, one can easily identify two countermeasures: Increas- 
ing the anonymity set and decreasing the relative similarity to the probe ADS. 
A solution suitable for each problem are anonymising proxies. 

5.1 Proxies 

A proxy is a service acting as an intermediary between a Web client and a 
Web server. Requests from the client are treated in the desired way and then 
forwarded to the server. Two kinds of proxies are applicable to reach or enlarge 
the anonymity of an instance and therefore to decrease the degree of assurance 
of found PAP for a probed ADS. 

Anonymising Web Proxies. Services for the anonymising of Web users are 
already existent on the Internet. The best known are Anonymizer HU] and Reweb- 
ber j I I il ‘2\ . Users can access Web pages directly via entering the intended URL 
in a Web form at each service or by configuring the service as their default proxy 
in the Web browser. The proxy acts as a middleman for the user. Both services 
are anonymising the HTTP header by exchanging the relevant HTTP header 
content with own text. This way an adversary logs a much higher number of 
ADS from such a service than a number of different ADS from the instances. 

Today anonymising proxies are not used that much, therefore they can not 
be seen as a standard part of the Internet infrastructure. 



192 



Thomas Demuth 



5.2 Local Proxies 

An approach to decrease the similarity is to vary header fields in each access to 
a Web page leading to very different sum of term weights in each access. This is 
only possible in a few header fields because many fields are determining the reac- 
tion of the server in an essential way (Accept-Language determines the language 
of a Web page sent back by the Web server, etc.). As an alternative, the existent 
header content can be extended by valid terms. In the case of Accept-Language 
(and most other HTTP header fields) this is possible, because header terms are 
considered with priority from the left to the right. Added extension can vary the 
header field (and change the similarity to the probe ADS) while the intended 
settings of a field are interpreted correctly. 

The described functionality can be performed by a simple software running 
locally on a user’s computer. 

6 Conclusion 

It has been shown that active attacks on a Web user’s privacy are obvious and 
conspicious but also easy to defend by means of anonymising techniques. 

The proposed passive attack uses access data that can easily be created. It 
shows a way and procedure how to measure a Web access by an instance, how 
to compare a probe ADS to other ADS, and how to track instances. 

The quality of the results can be measured with the shown degree of assurance. 

Furthermore, the attack can be applied in real-time using common computer 
hardware. 

The attack shows the vulnerability of the privacy of Web users and motivates 
the use of anonymising techniques. Two simple countermeasures are proposed 
which are even applicable for the normal Web user. 
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Abstract. Covert channels exist in most communications systems and 
allow individuals to communicate truly undectably. However, covert 
channels are seldom used due to their complexity. A protocol for sending 
data over a common class of low-bandwidth covert channels has been de- 
veloped. The protocol is secure against attack by powerful adversaries. 
The design of a practical system implementing the protocol on a stan- 
dard platform (Linux) exploiting a channel in a common communications 
system (TCP timestamps) is presented. A partial implementation of this 
system has been accomplished. 



1 Introduction 

A covert channel is a communications channel which allows information to be 
transferred in a way that violates a security policy. Govert channels are important 
methods of censorship resistance. An effective covert channel is undetectable by 
the adversary and provides a strong degree of privacy. Often the fact that secret 
communication is taking place between parties is undesirably revealing. 

Gonsider the prisoners’ problem, first formulated by Simmons |Sim84| . Alice 
and Bob are in prison attempting to plan an escape. They are allowed to commu- 
nicate, but a Warden watches all of their communications. If the Warden notices 
that they are planning to escape or even suspects them of trying to communicate 
secretly, they will be placed in solitary confinement. 

The prisoners’ problem is theoretically interesting and provides a good ex- 
planation of the problem that covert channels solve, this problem is increasingly 
relevant in real world situations. Many governments provide restrictions on the 
use of cryptography on their systems. The situation is particularly extreme in 
Ghina, where all ISPs are subject to government control, although electronic 
systems are increasingly subject to surveillance in all parts of the world. Private 
companies increasingly monitor and censor communications with firewalls. 
Govert channels require many apparently contradictory properties. 

• Plausibility - the adversary must believe that it is not only possible, but 
likely, that the user of the covert channel is using the medium in which the 
covert channel is found without sending covert data. 

• Undetectability - In order for the channel to be undetectable, the distribution 
of bits sent must conform to the distribution of the channel itself, otherwise 
the channel with be able to be automatically detectable via statistical tests. 
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• Indispensability - The channel must be something which an adversary cannot 
or will not close off, usually due to the fact that it serves some useful function 
to the users which are not sending covert data. 

TCP timestamps are useful for this purpose because they satisfy these prop- 
erties to a high degree. 

• TCP is ubiquitous protocol used all over the world for the majority of all 
Internet traffic. It is almost always employed using the timestamp option. 
As a result, it is extremely plausible that the majority of users using TCP 
will not be sending covert data and have a good reason for using it. 

• On a slow connection, the low order bits of timestamps appear random. As 
a result, statistical analysis of the headers of the packet will be ineffective in 
detecting the channel. 

• Since TCP is so widely used, it is likely that attempts to modify TCP packets 
on a large scale would inconvenience mainstream users. 

The channel is broken if it is detected by the adversary. This situation can 
be modeled as follows: For any two TCP connections t\ and t2 on the same 
network, an adversary cannot distinguish which of the connections is using with 
probability greater than ^ . The channel is also broken if the adversary can deny 
service to the users of the covert channel while still allowing service to the users 
of unmodified TCP connections. 

The adversary has the power to monitor all the traffic on the network and 
to modify some traffic. The adversary only views network traffic and cannot 
see the processing of packets on sender and receiver machines. We consider an 
eavesdropper who has the power to modify any number of packets, but cannot 
constantly modify packets sent by a single host. 

In the prisoners’ analogy, the warden would not be constantly changing every 
message which passes between prisoners, but might occasionally modify some 
messages in the hopes of detecting covert communication. The warden lacks the 
resources to modify every message sent between every prisoner all of the time. 
If the warden was sufficiently suspicious of one prisoner to modify all of their 
messages, he would just put that prisoner in solitary and be done with it. 

It is notable to realize that if a more powerful adversary than this is willing 
and capable of either preventing users from using the timestamp option with 
TCP or overwriting the low order bits of TCP timestamps of every packet, then 
the adversary will have closed the channel. We assume that the adversary is 
either unwilling to do this, unable to do this, or will be annoyed by being forced to 
do this. In addition, we believe that even if this channel is closed, the techniques 
presented in this paper will be useful in providing reliable communication over 
other low bandwidth covert channels. It is also useful to realize that even if the 
adversary denies service to the channel, he still cannot detect whether covert 
data was being sent regardless of how much data he modifies or snipes. 

Most of the interesting work which we have done deals with the problem of 
sending a message at a rate of one bit per packet over an unreliable channel, and 
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we believe that even if this particular channel is closed the work we have done 
will be relevant to other similar channels that may be identified. 

2 Related Work 

Many other channels have been identified in TCP. These include initial se- 
quence numbers, acknowledged sequence numbers, windowing bits and protocol 
identification. |f{ow96| [lJCD99| These papers focus on finding places where covert 
data could potentially be sent but do not work out the details of how to send 
it. Those implementations which exist IKow96l generally place into header fields 
values that are incorrect, unreasonable or even outside the specification. As long 
as the adversary is not looking, this may be effective, but it will stand up to 
concerted attack, being effectively security through obscurity. These systems do 
cannot withstand statistical analysis. 

The TCP protocol is described in RFC 793.[Eus31 A security analysis 
TCP/IP can be found in |Rel89) We are certainly not the first group of peo- 
ple to identify the possibility of using the TCP/IP Protocol Suite for the pur- 
poses of transmitting covert data. In “Covert Channels in the TCP/IP Protocol 
Suite.” |P,ow9B| Craig Rowland describes the possibility of passing covert data 
in the IP identification field, the initial sequence number field, and the TCP 
acknowledge Sequence Number Field. He wrote a simple proof-of-concept, raw- 
socket implementation covert_tcp . c. The possibility of hiding data in times- 
tamps is not discussed. We feel that embedding data in the channels identified 
here would not be sufficient to hide data from an adversary who suspected that 
data might be hidden in the TCP stream. 

In “IP Checksum Covert Channels and Selected Hash Collision .’’ [Aba.ni] the 
idea of using internet protocol checksums for covert communication is discussed. 
Techniques for detecting covert channels, as well as possible places to hide data 
in the TCP stream, are discussed (the sequence numbers, duplicate packets, 
TCP window size and the urgent pointer) in the meeting notes of the UC Davis 
Denial of ServicelDOSlProiect |lJCD'^ 

The idea of using timing information for covert channels (in hardware) is 
described in “Countermeasures and Tradeoffs for a Class of Covert Timing 
Channels. ”im] More generalized use of timing channels for sending covert infor- 
mation is described in “Simple Timing Channels.” jIVI IVI94j 

Covert channels are discussed more generally in a variety of papers. A gen- 
eralized survey of information-hiding techniques is described in “Information 
Hiding - A Survey. ”[ESEnn! Theoretical issues in information hiding are consid- 
ered in lOiznH] and HEHEI. John McHugh provides a wealth of information on 
analyzing a system for covert channels in “Covert Channel Analysis. ” !McH95j . 
The subject is addressed mainly in terms of classified systems. These sorts of 
channels are also analyzed in “Covert Channels - Here to Stay?” IIMK94I . These 
papers focus on the prevention of covert channels in system design and detecting 
those that already exist, rather than exploiting them. G.J. Simmons has done 
a great deal of research into subliminal channels |Sim84j |Sim94j |Sim93j (Sim98j . 
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He was the first to formulate the problem of covert communication in terms of 
the prisoners’ problem, did substantial work on the history of subliminal com- 
munication - in particular in relation to compliance with the SALT treaty and 
identified a covert channel in the DSA. 

3 Design 

The goal of this system is to covertly send data from one host to another host. 
There are two important parts to this goal. First, we must send data. Second, 
we must be covert (i.e. only do things that our adversary could not detect). 

It is important to note that these two goals are at odds with each other. 
In order to send data, we must do things that the receiving host can detect. 
However, we must not do anything that an eavesdropper can detect. 

We approach this problem by presuming the existence of a covert channel 
that meets as few requirements as possible. We then describe a protocol to use 
such a channel to send data. Finally, we identify a covert channel that meets the 
requirements that we have proposed. 

3.1 Characteristics of the Channel 

In designing our covert channel protocol, we seek to identify the minimum re- 
quirements for a channel which would allow us to send useful data. 

In the worst case scenario, the channel would be bitwise lossy, unacknowl- 
edged, and the bits sent would be required to pass certain statistical tests. By 
bitwise lossy, we mean the channel can drop and reorder individual bits. By un- 
acknowledged, we mean that the sender does not know what bits, if any, were 
dropped and does not know what order the bits arrived in. 

Using this channel to send data is extremely difficult. However, if we relax 
these restrictions in reasonable ways, the problem becomes clearly tractable. 

For simplicity, we will assume that the only statistical test that the bits must 
pass is one of randomness, since this will be convenient for embedding encrypted 
data. This is reasonable since it is not prohibitively difficult to identify covert 
channels that normally (i.e. when they are not being used to send covert data) 
contain an equal distribution of ones and zeros. 

We will also assume that each bit has a nonce attached to it and that if the 
bit is delivered, it arrives with its nonce intact. This condition is both sufficient 
to make the channel usable to send data and likely to be met by many covert 
channels in network protocols. The reason why it is an easy condition to meet is 
that most covert channels in network protocols involve embedding one or more 
bits of covert data in a packet of innocuous data. Thus, the innocuous data (or 
some portion thereof) can serve as the nonce. 

3.2 Assumptions 

We presume that we have a channel with the above characteristics. We further 
presume that the adversary cannot detect our use of that channel. Lastly, we 
presume a shared secret exists between the sender and receiver. 




198 



John Giffin et al. 



The first two presumptions will be justified in sections El and O respec- 
tively. The third presumption is justified on the grounds that it is impossible to 
solve the problem without it. This is the case because if the sender and receiver 
did not have a shared secret, there would be nothing to distinguish the receiver 
from the adversary. Any message that the sender could produce that was de- 
tectable by the receiver could be detected by the adversary in the same manner. 
Note that public key cryptography is no help here, because any key negotiation 
protocol would still require sending a message to the receiver that anyone could 
detect. 

We also, assume that it is sufficient to implement a best effort datagram 
service, such as that provided (non-covertly) by the Internet Protocol. In such 
a service, packets of data are delivered with high probability. The packets may 
still be dropped or reordered but, if a packet reaches its destination, all the 
bits in the packet reach the destination and the order of the bits within the 
packet is preserved. This level of service is sufficient because the techniques to 
implement reliability over unreliable datagrams are well understood, and in some 
applications reliability may not be required. 

We now present a method to implement best effort datagrams over a channel 
with the above characteristics. 



3.3 Protocol 



In order to send messages over this channel, we send one bit of our message block 
M per bit of the channel, rather than sending some function of multiple bits. 
This way, each bit of the data is independent and if one bit is lost or reordered 
it will not affect the sending of any of the other bits. We choose which bit of 
the message block to send based on a keyed hash of the nonce. That is, for a 
message block of size I and a key AT, on the packet with nonce t we send bit 
number n where 

n = H{t,K) (mod Z) (1) 



The hash function H should be a cryptographic hash function which is 
collision-free and one-way. Because the nonce T will vary with time, which bit 
we send will be a random distribution over the I bits in the block. We can keep 
track of which bits have been sent in the past, in order to know when we have 
sent all the bits. The expected number of channel bits x it takes to send the I 
bits of the block will be 



X = 



E 




(2) 



Of course, because our channel loses bits, this is not sufficient. We thus 
send each bit more than once, calling the number of times we send each bit 
the occupation number of that bit, o. The probability of our message getting 
through, p, will be based on the probability that a bit is dropped d and the 
occupation number o. The probability will be bounded below by {1 — d°y. Thus 
for any drop rate, we can choose a sufficiently high occupation number to assure 
that our messages will get through. And for small drop rates the occupation 
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number does not need to be large to for the probability of successful transmit 
to be high. 

When sending each bit, it must have the same statistical properties as the 
covert channel has when not being used or else an adversary could use statistical 
analysis to detect the use of the channel. As we mentioned above, we assume 
that the channel is normally random. Thus, our bits must appear random. Since 
much research has been done in finding cryptographic means to make ciphertexts 
indistinguishable from random distributions, this will be easy. We accomplish 
this as follows. We derive a key bit k from the same keyed hash of the nonce t 
in Equation 0 making sure to not correlate n and k. 



The transmitted bit b is the exclusive or of the key bit k and the plaintext 
message bit Because k seems random, will seem random, and thus the 
random characteristic of our channel is preserved. 

There are several techniques that the sender can use to determine when a bit 
has been transmitted. 

The sender assumes that a block has been transmitted after it has achieved 
the occupation number o for every bit in the message. In order for the receiver 
to know when they have received a block, the last 1^ bits of the message are a 
checksum C of the first I — Ic bits. 

3.4 Finding a Covert Channel 

In attempting to locate a covert channel we restrict our considerations to covert 
channels over the network. This is because most of the time the network is the 
only mechanism through which a pair of hosts can reasonably communicate. 

There are two ways that we could transmit information. We could send new 
packets and try to make them look innocuous, or we could modify existing 
packets. Obviously, it will be easier to maintain covertness if we modify exist- 
ing packets. If we were to send new packets, we would need to come up with 
a mechanism to generate innocuous looking data. If an adversary knew what 
this mechanism was, they could likely detect our fake innocuous data and our 
communication would no longer be covert. In contrast, if we modify packets, 
all packets that get sent are legitimate packets and an adversary will have a 
more difficult time detecting that anything is amiss. Thus, we choose to modify 
existing packets. 

We can modify existing packets in two ways. We can modify the application 
data or we can modify the protocol headers. Modifying the application data 
requires a detailed understanding of the type of data sent by a wide variety of 
applications. Care must be taken to ensure that the modified data could have 
been generated by a legitimate application, and we must guess what sort of 
applications the adversary considers innocuous. It is easier and more general 
to modify the protocol headers because there are fewer network protocols in 




0 otherwise 



J = 0 (mod 2) 



(3) 



200 



John Giffin et al. 



existence than application protocols. Most applications use one of a handful of 
network protocols. Furthermore, the interpretation of protocol header fields is 
well defined, so we can determine if a change to a field will disrupt the protocol. 

The problem remains, however, that we must only produce modified protocol 
headers that would normally have been produced by the operating system. For 
example, we could attempt to modify the least significant bit of the window 
size field of TCP packets. However, most 32 bit operating systems tend to have 
window sizes that are a multiple of four. Since our modification would produce 
many window sizes that were not multiples of four, an adversary could detect 
that we were modifying the window size fields. Similarly, we could attempt to 
hide data in the identification field of IP packets. However, many operating 
systems normally generate sequential identification field values, so an adversary 
could detect the presence of covert data based upon this discrepancy. 

For these reasons, we wish to avoid directly modifying packet headers. Instead 
we observe that more subtle modifications to the operating system’s handling 
of packets can result in a legitimate (and, thus, presumably harder to detect) 
change in headers. In particular, if we delay the processing of a packet in a 
protocol with timestamps, we can cause the timestamp to change. 

Detecting these delays will likely be very difficult because operating system 
timing is very complex and depends on many factors that an adversary may not 
be able to measure - other processes running on the machine, when keys are 
pressed on the keyboard, etc. Thus, this technique for sending information is 
very difficult to detect. 

We now look at applying this technique to TCP to create a channel with the 
properties described above. 

3.5 TCP Timestamps as a Covert Channel 

By imposing slight delays on the processing of selected TCP packets, we can 
modify the low order bits of their timestamps. 

The low bit of the TCP timestamp, when modified in this way, provides a 
covert channel as described above. The low bit is effectively random on most 
connections. The rest of the packet, or some subset, can be our nonce. When 
examined individually, packets (and thus bits) are not delivered reliably. 

Because TCP timestamps are based purely on internal timings of the host, 
on a slow connection their low bits are randomly distributed. By rewriting the 
timestamp and varying the timing within the kernel, we can choose the value of 
the low bit. As long as we choose values with a statistically random distribution, 
they will be indistinguishable from the unaltered values. 

The rest of the TCP headers provides a nonce that is nearly free from rep- 
etition. The sequence number sent with a TCP packet is chosen more or less 
randomly from a 2^^ number space. Thus, it is unlikely to repeat except on re- 
transmission of a packet. Even if it does repeat, the acknowledgment number 
and window size fields will likely have changed. Even if those fields are the same, 
the high order bits of the timestamp will likely have changed. It is extremely 
unlikely that all of the headers, including the high order bits of the timestamp, 
will ever be the same on two packets. 
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While TCP is a reliable stream protocol, it provides a stream of bytes that 
are reliably delivered, rather than guaranteeing reliable delivery of individual 
packets. For example, if two small packets go unacknowledged they may be 
coalesced into a single larger packet for the purpose of retransmission. As a 
result, bits associated with the packets can be dropped, when their packets are 
not resent. Also, because bytes are acknowledged rather than packets, it is often 
not clear whether a given packet got through, further complicating the question 
of whether a bit was delivered. 

3.6 TCP Specific Challenges 

Rewriting TCP timestamps presents some additional challenges over and above 
a standard implementation of the protocol from Section Timestamps must 
be monotonically increasing. Timestamps must reflect a reasonable progression 
of time. And when timestamps are rewritten, it can cause the nonce in the rest 
of the packet to change. 

Timestamps must be monotonically increasing. Because timestamps are to 
reflect the actual passing of time, no legitimate system would produce earlier 
timestamps for later packets. Were this done, it could be observed by checking 
the invariant that a packet with a larger sequence number in a stream also 
has a timestamp greater than or equal to other packets in that stream. When 
rewriting timestamps, we must honor this invariant. As a result, if presented 
with the timestamp 13 and needing to send the bit 0, we must rewrite to 14 
rather than 12. Additionally, we must make sure than any following packet has 
a timestamp of not less than 14, even if the correct timestamp might still be 12. 

Timestamps must reflect a reasonable progression of time. Though times- 
tamps are implementation dependent and their low order bits random, the pro- 
gression of the higher order bits must reflect wall clock time in most implemen- 
tations. Because an adversary can be presumed to know the implementation of 
the unmodified TCP stack, they are aware of what the correct values of times- 
tamps are. In order to send out packets with modified timestamps, and keep 
timestamps monotonically increasing, streams must be slowed so that the times- 
tamps on packets are valid when they are sent. Thus, we can be thought of as 
not rewriting timestamps but as delaying packets. 

As an additional challenge, because we must only increase timestamps, we 
will sometimes cause the high order bits of the timestamp to change. To decrease 
the chance of nonce repetition, we include the higher-order bits of the timestamp 
in the nonce. When incrementing timestamps, these bits may change, and the 
nonce will change. When the nonce changes, we will have to recompute n and k, 
and thus may have to further increment the timestamp. However, at this point 
the low bit of the timestamp will be 0, and so incrementing will not change the 
nonce. This algorithm can be seen in Figure ^ 

3.7 Choosing Parameters for TCP 

For a checksum of size n bits, a collision can be expected one time in 2”. As- 
suming a sustained packet rate of ten packets per second (an upper bound), we 
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Fig. 1. Rewriting Timestamps 

will see a collision every 2nJ seconds. We selected our checksum to be a multiple 
of eight and a power of two to keep the checksum byte aligned and to make it 
consistent with standard hash functions. A checksum size of 16 bits is clearly 
too small, as it results in collisions every two hours. A 32 bit checksum raises 
this time to 13.5 years, which we deem to be an acceptable without making the 
amount of data per block too small. 

4 Implementation 

4.1 Sending Messages 

Our sender is implemented on top of the Linux kernel. The current implemen- 
tation of a sender is a minor source modification to provide a hook to rewrite 
timestamps, and a kernel module to implement the rewrite process, to track the 
current transmission, and to provide access to the covert channel messaging to 
applications. The current system only provides one channel to one host at a 
time, but generalizing to multiple channels should not be difficult. 

We selected SHAl as the hash. It is a standard hash function, believed to be 
collision resistant and one-way. Source is freely ava,i1ab1e jl ).l()1 ] . which made it 
even more attractive. We needed to put our own interface on SHAl and modify 
the code so that it could be used in both the kernel code and in the receiving 
application. 

The basic algorithm is for each packet compute the cipher text bit to be 
included in that packet according to Figure O Then the timestamp is rewrit- 
ten according to the method described in Figure Q This is a simple function 
implementing the rewriting algorithm described in Section lit. til This algorithm 
can be seen in the pseudocode of Figure 0 particularly in the recursive call to 
EncodePacket. 
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Fig. 2. Sender 

ENCODEPACKET(Packet P, TimeStamp T) 

GetHeaders(P) — )• PacketHeader 
SHAl(PacketHeaders)[0-7] — )• Index 
CurrentBlock [Index] — )• Plain TextBit 
SHAl(PacketHeaders)[8] — )• KeyBit 
Plain TextBit © KeyBit — )■ CipherTextBit 
if T[0] CipherTextBit then 
T + 1 T 
if T[0] = 0 then 

return EncodePacket(P,T) 
end if 

TransmitCount [Index] ++ 

if Minimum(TransmitCount) > MinimumTransmitCount then 
NextBlock — )• CurrentBlock 

end if 
end if 

SendPacket(P,T) 

Fig. 3. Pseudocode for EncodePacket 

To encode a packet, the timestamp is incremented until it has the proper 
value to be sent. When a packet is ready to be sent, the occupation number for 
the bit in the packet is increased. Occupation numbers are tracked in the array 
TransmitCount. If the minimum occupation number of every bit in the block is 
ever higher than the required occupation number, the block is presumed received 
and the next block begins transmission. 

4.2 Receiving Messages 

The receiving process is designed to be portable and entirely located in user- 
space. It is much simpler than the sender side and the primary interesting part 
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Fig. 4. Receiver 

ReceivePacket (Packet P, TimeStamp T) 
GetHeaders(P) — )• PacketHeader 
SHAl(PacketHeaders)[0-7] — )• Index 
T[0] — > CipherTextBit 
SHAl(PacketHeaders)[8] — > KeyBit 
CipherTextBit © KeyBit — > Plain TextBit 
PlainTextBit — )■ CurrentBlock[Index] 
if ValidateChecksum(CurrentBlock) then 
OutputBlock 
end if 



Fig. 5. Pseudocode for ReceivePacket 



is determining when we are done with a block and the boundaries between 
different data blocks. 

Packets are collected by the receiver using the libpcap interface to the Berke- 
ley Packet Filter EMI . This library is part of the standard utility tcpdump and 
has been ported to a wide variety of platforms. Our receiver is simple C, using 
only libpcap and our SHAl library. Unlike the sender, it is not tied to the Limrx 
platform, and will probably run anywhere that libpcap will run. 

The receiver maintains a buffer initialized to all zeroes which represents the 
current data block to be decoded. As packets are received, the receiver computes 
the hash of the packet headers concatenated with the shared secret. He then 
XORs bit 8 of the hash with the low order bit of the timestamp of the packet, 
he places the result in the buffer at the place indicated by the index. 

In actuality, the data block contains less than BLOCKSIZE bits of data. 
Appended to it is a checksum of the data. The purpose is the checksum is 
to inform the receiver when he has received the entire valid block and should 
output plaintext and allocate a new block buffer. The receiver calculates this 
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hash every time he receives a bit and adds it to the buffer. This checksum needs 
to be collision resistant such that the probability that the receiver will believe 
he has prematurely found a valid output without actually having done so (either 
by chance or design by the adversary) is sufficiently low. 



5 Evaluation 

5.1 Security 

The security of this protocol is violated when an adversary can determine what 
data we are sending or that we are sending data at all. 

The system is designed to send out packets in intervals such that the times- 
tamps monotonically increase over time and such that they do not increase 
quicker than the 10 ms granularity of linux timestamps. We use our ability to 
delay packets to make certain that the mean time between packets is relatively 
constant. In a normal link of this speed, it would be expected that the low order 
bits of the timestamp would be unpredictable. If this is the case, we will have 
achieved our goal of undetectability. 

Two things contribute to the low order bit of the timestamp: the plain text 
bit and the key bit. Our goal is to xor these two bits together and obtain an 
unpredictable bit that we can use in our timestamp. Recall that the key bit is 
the 9th bit of the hash of the packet headers and the secret key. Given a random 
oracle model for the hash function used by the sender, the key bit will be a 
random number provided that packet headers do not collide. 

Packet headers collide only when all TCP header fields are the same, includ- 
ing sequence number, window, flags, options, source port, destination port, and 
the high-order 31 bits of the timestamp; the odds of such a collision happening 
are remarkably small. As long as no such collisions occur, the XOR of the plain 
text bit with the key bit is essentially a one-time pad. The low order 9 bits of the 
hash will collide approximately once every 512 packets, but the adversary has 
no way to detect these collisions without the key, he is left with a chance 

that there is a collision on the 9th bit, in which case he can determine if the 
two timestamp bits are the same or different. However, he has a chance 

of guessing this anyway. 

Should headers collide, one bit of information is revealed about the two bits 
of plain text encoded in those two packets. Even so, no information is gained 
about the sender’s secret kej0. Of course if the headers collide often enough the 
timestamps will display the entropy of the message, which is lower than expected 
and the process with be detectable. It is meaningful to note that if the headers 
collide often enough it will take a great deal of packets and a large amount 
of time to get the message through at all. This might tip off the sender that 
something odd is up. Of course, too many collisions will make the low order bits 
of the timestamp have low entropy and the system will be detectable. 

This assumes that the hash function used is one-way. 
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We believe the probability of collisions, using a good hash function with 
the properties specified in our design section, is sufficiently low to avoid the 
problem of a two time pad. However, if the users were in a situation where the 
adversary could somehow cause the packet headers to collide, or simply in a 
system where the packet headers had little entropy, the security of the system 
could be increased by encrypting blocks of data before sending them (using an 
encryption mode that does not cause interdependency between blocks). In this 
way, the entropy of the data being sent would be high. This bit would then 
be sent as the low order bit of the timestamp and the system would otherwise 
proceed as before. It is notable to realize that this contributes to the inefficiency 
of the system because the hash also must still be computed in order to generate 
the 8 bit index. 



5.2 Robustness 

The system is also broken if the adversary can deny service to the users of the 
channel. A powerful adversary could defeat this system by overwriting the tcp 
timestamps of all packets in some unpredictable way. This is a feasible attack 
for some powerful adversaries. However, it is not currently being done in any 
situations of which we are aware. If this paper caused this system to be used 
so widely that corporations and governments who wished to censor those us- 
ing networks they controlled were forced to modify all packets, that would be 
a favorable outcome. The techniques applied in this paper could be used to de- 
sign another covert channel which would have to be closed, raising the cost of 
censorship. 

Apart from this attack our system is fairly robust. Since all bits are sent 
independently multiple times, modifications and packet sniping are unlikely to 
prevent the message from getting through unless performed in large numbers. 

5.3 Performance 

After sending 3000 packets, there is a 99.6% chance that we have sent every bit 
at least once. After sending 5000 packets the probability that we have not sent 
every bit has dropped to around 1 in a million. 

3000 packets may seem like a lot but a single hit on an elaborate website 
can generate 100 packets or more, especially if the site has many images which 
must be fetched with individual HTTP GET requests. Furthermore, transfer 
of a 3 megabyte file will likely generate that many packets. Thus, it is fairly 
easy to generate enough packets to assure a fairly high probability of successful 
transmission of a data block. 

To send a total of n bits, the message will take approximately n * 3.75ms if 
the sender is not limited by network constraints. 3.75ms is the expected time to 
wait to send a packet. 
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6 Conclusion and Future Directions 

6.1 Conclusions 

We have designed a protocol which is applicable to a variety of low bandwidth, 
lossy covert channels. The protocol provides for the probabilistic transmission 
of data blocks. 

Identifying potential covert channels is easier than working through the de- 
tails of sending data covertly and practically through them. The protocol gives 
a method for sending data over newly identified cover channels with minimal 
design investment. The implementation of this protocol with TCP timestamps 
is not yet complete, but we are confident that there are no major obstacles 
remaining. 



6.2 Future Directious 

Future directions of our research involve improvements to our implementation 
and work on channel design that deals with more powerful adversaries and more 
diverse situations. 

It would be useful if the sender in the implementation were able to track, 
possible via ack messages, which data had actually been received by the receiver. 
If this were the case, the sender would not have to rely on probability to decide 
when a message had gotten through and when he should begin sending more 
data. 

It would also be useful to develop a bidirectional protocol that provided 
reliable data transfer. Although it would theoretically be possible to implement 
something like TCP on top of our covert channel, this would likely be inefficient. 
Thus, it would be useful to develop a reliability protocol specifically for this type 
of channel. 

We would also like to identify channels which a resource rich active adversary 
would not be able to close. It would also be useful to deal with key exchange, as 
our sender and receiver may not have the opportunity to obtain a shared secret. 

Our system is currently only practical for short messages; it would be de- 
sirable to be able to send more data. Lastly, our protocol is designed to work 
between two parties. It would be interesting to design a broadcast channel such 
that messages could be published covertly. 
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Abstract. A private information retrieval (PIR) protocol allows a user 
to retrieve one of N records from a database while hiding the identity of 
the record from the database server. 

With the initially proposed PIR protocols to process a query, the server 
has to process the entire database, resulting in an unacceptable response 
time for large databases. Later solutions make use of some preprocessing 
and offline communication, such that only 0(1) online computation and 
communication are performed to execute a query. The major drawback 
of these solutions is offline communication, comparable to the size of the 
entire database. 

Using a secure coprocessor we construct a PIR scheme that eliminates 
both drawbacks. Our protocol requires 0(1) online computation and 
communication, periodical preprocessing, and zero offline communica- 
tion. The protocol is almost optimal. The only parameter left to improve 
is the server’s preprocessing complexity - the least important one. 

Keywords: Efficient realization of privacy services. 



1 Introduction 

A private information retrieval (PIR) protocol allows a user to retrieve one of N 
records from a database while hiding the identity of the record from a database 
server. That is, with a PIR protocol, the user can perform the query “return 
me the i-th record” in such a way that no one, not even the server, receives any 
information about i. 

Many practical e-commerce applications could benefit from using PIR to 
address user privacy mu. An obvious, common application is trading digital 
goods. Using PIR, a user may retrieve a selected subject (a digital article, an 
e-book, or a music file etc.) privately. “Privately” means that a digital good is 
retrieved such that no one except the client observes the identity of the good. 
Billing can be performed as well, because the retrieval is controlled by the server. 

Naturally, the quality of every PIR protocol is measured by the two following 
parameters: the complexity of the computation to perform one query, and the 
complexity of the communication done between the client and the server to 
execute one query. 

* This research was supported by the German Research Society, Berlin-Brandenburg 
Graduate School in Distributed Information Systems (DFG grant no. GRK 316.) 

R. Dingledine and P. Syverson (Eds.): PET 2002, LNCS 2482, pp. 209-|22^ 2003. 
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1.1 Motivation 

Initially developed PIR protocols lack scalability dramatically, making it impos- 
sible to use them in the real world. In order to process a PIR query to retrieve a 
single record, the server must perform complex computations with each record 
of the entire database. 

Several attempts were made to address this problem. Two papers present the 
state of the art |BDP00IS J.Q.Q) . PIR protocols by Bao et. al. and Schnorr et. ah, 
being developed independently, employ very similar ideas to address the problem, 
although they introduce a new one. Namely, offline communication, comparable 
to the size of the entire database, must be performed between the client and 
the server before these protocols start. (Sect. 12. ill discusses these protocols in 
details.) 

1.2 Our Results 

We present a protocol that addresses the problem of constructing a PIR pro- 
tocol with 0(1) complexity of query response time and communication, and no 
offline communication. For our protocol, 0(1) records have to be processed on- 
line in order to answer a query. But, in contrast to IBUFOOIS.IOOI . the protocol 
eliminates the offline communication completely. Our protocol is almost opti- 
mal in the sense that the only parameter left to be optimized is the server’s 
preprocessing complexity - the least critical one. 



1.3 Preliuiiuaries aud Assuruptious 

In the following, N denotes the number of records in the database. The only 
type of query considered is “return me the i-th. record”, 1 ^ i ^ N . 

As in ISS()()iSS()llBDF()(iS,l()()| . we omit the precise mathematical definition 
of privacy while presenting the protocol. The definition “no information about 
queries is revealed” is enough. However, we introduce a formal definition of 
privacy later in this paper in order to formally prove that the protocol fulfills 
the privacy property. 

We say that a PIR protocol has 0{A) communication complexity and 0{B) 
computation complexity if only 0{A) records must be communicated between 
the server and client, and only 0{B) records must be processed by the server (in 
Oder to answer one query) . For example, we say that computation complexity is 
0(1) if the number of records, that has to be processed by the server to answer 
a query, is independent from N . 

By the client we denote the computer the user forms his queries with. We 
assume that the user trusts the client; i.e., the sensitive information stored or 
processed at the client is assumed to be hidden from every one except the user. 



1.4 Structure of the Paper 

Having analyzed the related work in the next section, we present our basic 
protocol in Sect.El Further details on the protocol and discussion are presented 
in Sect. El We finish the paper with the discussion and future work ideas. 
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Furthermore, Appendix 0 formalizes the protocol. Due to space limitation, 
a formal definition of privacy based on information theory and the proof that 
the protocol is private are not included in the final version of the paper. Both 
the privacy definition and the protocol proof can be found in |AF()1| . 

2 Related Work 

The PIR problem was first formulated by Chor et al. iCCm6\ . From the very 
beginning two fundamental limitations became clear: 

1. PIR is impossible, unless we consider sending the entire database to the client 
as a solution. That is, the communication complexity of any PIR protocol 
to perform one query is proven to be I?(7V) 0. 

2. In order for any PIR protocol to answer one query, the entire database 
must be read. This conclusion is based on the following simple observation: 
Independently from how a PIR protocol works, if the server does not read 
some of the database records while answering a query, then the (malicious) 
server may observe the records that the client did not request. This is a 
privacy violation by definition. 

While the first limitation affects the first parameter of a PIR protocol - the 
communication complexity, the second limitation affects the second parameter 
of a PIR protocol - the server computation complexity (or, query response time). 

The following three sections show the efforts made to overcome these limita- 
tions. After each description we summarize the pros and cons of each protocol. 
Finally, before the description of our protocol, we give a comparison of the state 
of the art with our protocol in Sect. 12.41 



2.1 Computational PIR 

Although PIR with communication complexity less than S1{N) is impossible 
theoretically, it is found to be possible if computational cryptography is used 



latjH ausRes] immii 



The underlying idea is to rely on some intractability assumptions (the hard- 
ness of deciding quadratic residuosity, in case of |K()97] 1. Then, a protocol works 
as follows. The client encrypts a query “return me the z-th record” in such a 
way, that the server still can process it using special algorithms and the entire 
database as an input. However, under an intractability assumption, the server 
recognizes neither the dear-text query nor the result. The result can be decrypted 
by the client only. 

^ There is also a modification of the problem setting called “multi-server PIR” , where 
several servers hold copies of the database. A communication complexity better 
than n{N) may be achieved under the assumption that the servers do not collude 
against the user [( XIKSfi.lK XIiI7IAmh!I7lBIQl| . The idea is to send different queries 
to different servers, so that i is not derivable from any single of them. But having 
the all answers gathered, the client can derive the z-th record. In this paper we do 
not consider the schema based on several servers non-communicating to each other. 
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Fig. 1. An example of a PIR protocol with SC. 



Pros and Cons. Computational PIR protocols break through the first limitation; 
fKTFTTI provides polynomial communication complexity, improved by polyloga- 
righmic communication complexity in |CMSt)OIKV()l| . 

Still, the second limitation holds for such protocols: the server has to process 
each record of the entire database to answer one query. Although these proto- 
cols are beautiful research jobs from the viewpoint of mathematics, 0{N) com- 
putation complexity makes them practically infeasible even for small databases 

| EDPQI1| . 

2.2 Hardware-Based PIR 

Smith et al. |SS00ISS0T| make use of a tamper-proof device to implement the 
following PIR protocol. 

The idea is to use a secure coprocessor (a tamper-proof device) as a black 
box, where the selection of the requested record takes place. Although hosted 
at the server side, the secure coprocessor (SC) is designed so that it prevents 
anybody from accessing its memory from outside jSPWflSj . 

The basic protocol runs as shown in Fig. Q The client encrypts the query 
“return me the i-th record” with a public key of the SC, and sends it to the 
server. The SC receives the encrypted query, decrypts it, reads through the 
entire database, but leaves in memory the requested record only. The protocol 
is finished after the SC encrypts the record and sends it to the client. 

To provide integrity, the SC keeps all records of the database encrypted. We 
discuss this in details in Sect. WJ\ 
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Fig. 2. An example of a PIR protocol with preprocessing and offline communication. 
Steps 1 and 2 are made offline once, and the other steps are performed online for every 
query submission. 



Pros and Cons. This PIR protocol improves the computation complexity. In 
comparison to computational PIR protocols, ordinary decryption has to be per- 
formed for each of the N records to process a query. 

The main disadvantage of this PIR is the same as that of the computational 
PIR protocols: the second limitation, e.i. , 0{N) computation complexity. 



2.3 PIR with Preprocessing and Offline Communication 

Although it does not seem feasible to break through the second limitation - 0{N) 
computation, one could try to preprocess as much work as possible. Such that, 
when a query is submitted, it would require only 0(1) computation to answer 
it onlin^. 

With this idea in mind, [RnFDfllS.Tflflj present independently very similar PIR 
protocols. Both utilize a homomorphic encryption, which is used by the server 
to encrypt offline every record of the database. All these encrypted records are 
sent to the client. This communication has to be done only once between the 
client and the server before the PIR protocol starts, independently from how 
many PIR queries will be processed online. 

If the user wants to buy a record, he selects the appropriate (locally stored) 
encrypted record and re-encrypts it. Then, the client sends it to the server and 
asks to remove the server’s encryption. The server is able to do it because of the 

^ As already explained above, we do not consider here approaches oriented for a setting 
with several servers non-communicating to each other HI M flOK 1 1 ( )98l( f ( f M D8j . 
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Table 1. Comparative analysis of the proposed protocol. 



Parameter 


PIR Protocol 1 


Computational 


With sc 


With 

Preprocessing 


The Proposed 
(With SC) 


ICMSililRVOIl 


IB^OOISSOll 








Online commu- 
nication 


assimpt. 

optimal 


optimal 


optimal 


optimal 


Computation 


0(N) 


0(N) 


0(1) 


0(1) 


Offline comm. 


no 


no 


0(N) 


no 


Preprocessing 


no 


yes 


yes 


yes 



homomorphic property of the encryption. The server removes its encryption, but 
cannot identify the record because of the users’s encryption. The server sends it 
back to the client. The user removes his encryption; the protocol is done (Fig.EI). 

Pros and Cons. The protocols with preprocessing and offline communication 
overcome the second limitation: Only 0(1) computation is required online to 
answer one query, i.e., these protocols ensure a practical query response time. 

However, the protocols suffer from another drawback: This is offline commu- 
nication comparable to the size of the entire database, that makes their practical 
applicability questionable. (Imagine a user decides to buy a single digital book 
or a music file at some digital store. He will probably react negatively after being 
asked to download the entire encrypted content of the digital store in oder to 
proceed. Another problem is keeping the user’s database copy updated.) 

2.4 State of the Art and Our Results 

In Table Ewe summarize the existing PIR protocols and compare them with the 
proposed PIR protocol. 

In summary, protocols with preprocessing fBUF01HS,l[)[)] are the most effec- 
tive in terms of online computation and online communication complexity. Our 
PIR protocol retains these parameters, but it does not require offline communi- 
cation in comparison to jBDFOQISJQOj . 

3 The Basic Protocol 

We start with the same basic model as described in Sect. 12.21 But in addition, 
before starting the PIR protocol, the SC shuffles the records offline. That is, the 
SC computes a random permutation of the records, and stores this permutation 
in an encrypted form. Now, the server has no evidence of which record is which. 

After the SC receives the first query “return the Tth record” , the SC does not 
need to read the entire database. Instead, the SC accesses the desired encrypted 
record directly. Then the encrypted record is decrypted inside the SC, encrypted 
with the user’s key, and sent to the user (Fig. ED . 

To answer a second query, the SC reads the previously accessed record first, 
then the desired record. If the previously accessed record is not read by the SC, 
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Fig. 3. I/O flows in the proposed PIR protocol. 

then the privacy of the second query could obviously be broker^ In case the 
second query requests the same record as the first query, the SC reads a random 
(previously unread) record. Otherwise some information about the queries is 
revealed, namely, it would be observable that the queries are identical. 

So, to answer the k-th query, the SC has to read the k — 1 previously read 
records first. Then the SC reads one of the unread records. Evidently, the SC 
has to keep track of the accessed records. 

It is up to the server to decide at which m = max(k) (1 ^ to ^ N) to stop 
and to switch to another shuffled copy of the database, so that k would be equal 
one again. Since to is a constant independent of N , we can say that the server 
has to process 0(1) records online to answer each query. 

Now that the basic idea has been introduced, we go into details of our protocol 
in the next section. The formalization of the protocol is presented in AppendixEl 

4 The Details 

A hypothetical attack is considered in Sect. lO We demonstrate a trade-off 
between offline and online computation in our protocol and discuss how to choose 

® Assume that the server issued the first query itself. Then the server observed which 
record was read by the SC. So the identity of the one encrypted record is known to 
the server. Now, the SC reads another encrypted record to answer a users’s query. 
The server can observe that the client is interested in the record different to the 
record that the server requested before. This is a privacy violation. 
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the optimal trade-off in Seet. l4.‘2l a,nd l4.,Sl resr)eetive1v. Finally, we consider shortly 
the cases with multiple queries and multiple secure coprocessors. 



4.1 Active Attacks 

In an attack is considered, where the malicious server destroys or modifies 

an arbitrary record before the PIR protocol starts. If the user complains after 
the PIR query is performed, the malicious server concludes that the user was 
interested in the modified record, thus breaking the privacy of the user. The 
solution proposed is to check the granularity of every record in the database 
(while reading the entire database through) for every query. If a record with the 
broken granularity appears, the SC aborts PIR protocol, independently from 
whether the forged record is requested by the client or not isHoni. In order to 
provide the granularity control, each record is stored in an encrypted form. 

The malicious server might try the same attack within our PIR protocol. 
In this case, the SC does not have to check the integrity of each record in 
the database to process one query. It is enough to check the granularity of the 
requested encrypted record only. 



4.2 Trade-Off between Preprocessing and Online Computation 

In our protocol, it is possible to balance the workload between the online and 
preprocessing phases. Decreasing the amount of online work increases the offline 
work and vice versa. Let m (1 ^ to ^ N) be a threshold number of records 
allowed to be read online to answer a single query, as explained in Sect. 0 
Obviously, to is a trade-off parameter. Reducing to will decrease the online com- 
putation (and, consequently the query response time of the server), but will 
increase the preprocessing workload. 

Let r online be the average number of encrypted records that the SC reads on- 
line to answer a query. This parameter characterizes the average response time of 
the server. Let 'WoffUne be the average number of encrypted records that the SC 
writes in order to be prepared to answer one query. This parameter character- 
izes the average amount of additional storage used by the SC for answering one 
query. Our equations below show both parameters expressed using the trade-off 
parameter. 

1 -I- 2 -I- 3 -I- ... -I- TO TO * (to -I- 1) TO-l-1 
^online — — OJ 

TO 2 * TO 2 



^ of f line — ( 2 ) 

TO 

The dependencies between the trade-of parameter to, the online workload Voniine^ 
and the preprocessing workload Woffune are shown in Fig. 0 (for N = 10000). 
From equations 0 and El we derive the dependence between the online (VonUne) 
and offline {'WoffUne) parameters of the protocol. 



^online — 



N 

2 * ^ of f line 



H” ^online — ^ 



^offline 



(3) 
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Fig. 4. The dependence between online performance (max and average number of 
records to read online per query) and preprocessing workload (number of offline write 
operations per query). 



The last equation exhibits that each reduction of the response time by an order 
leads to a blow up in preprocessing work by an order. 



4.3 Choosing the Optimal Trade-Off 

If the maximal allowed query response time is fixed, choosing the trade-off pa- 
rameter m is a straightforward task. 

Another strategy for choosing the trade-off parameter might be minimizing 
the overall work S{m), defined as the sum of the normalized online and prepro- 
cessing workload parameters. 

We show in Fig. 0that the overall work S{m) does not remain constant while 
varying trade-off parameter. To determine the optimal trade-off parameter we 
must find the minimum of the following function: 

S(^Ul) — Ton/me “t” ^norm * ^ of f line (4) 



where ipnorm is the normalization coefficient used to normalize the two parame- 
ters. 

We resolve the optimal trade-off by finding the roots of the derivative of 
S{m): 



S'{m) 



m + 1 



■ Vn 




1 y^norm * ^ 

2 



( 5 ) 



1 y^norm * ^ 
^ ^opt 



'^opt — 






* y>n 
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Fig. 5. The overall work done per query (calculated as a sum of normalized online and 
offline parameters) is not constant for different values of the trade-off parameter. 



For example, for ipnorm = 1 (reading one record online is considered equal 
to writing and storing one record offline) and N = 10000 the optimal trade-off 
parameter is mopt = [V^ * iV] = 141. 

4.4 Multiple Queries and Multiple Coprocessors 

Multi-query optimization may be advantageous for our protocol. When several 
queries arrive, the SC may read previously accessed records only once, thus 
eliminating the need to perform this operation for every query. 

Shifting from a single SC to multiple SCs is not a trivial task for the PIR 
scheme in ISS03- For our scheme, distributing the work between several SCs is 
obvious. For example, due to a small online workload, one SC might be dedicated 
to answering queries; and the rest secure coprocessors can do the preprocessing 
work, i.e. preparing several shuffled copies of the database. Such a simple paral- 
lelization is possible since preprocessing can be done independently from query 
processing. 

5 Discussion 

In this section we consider some questions that arise while reading the paper. 

5.1 Trust in SC 

One might assume, that users must trust the SC. In the following, we clarify that 
the SC must not be trusted by users. Instead, the trust in SC is reduced to the 
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trust in statement 3 below, given statements 1,2 can be checked independently 
from user’s believ^. 

1. The SC is a product of company A. 

2. The SC is unbreakable, if it is designed in accordance to the product design 
documentation of company A. 

3. The SC is designed in accordance to the product design documentation. 

In summary, the only possible concern about the SC is that it might not (theoret- 
ically speaking) be designed in accordance to its specification, thus intentionally 
leaving some hole on the physical or API level. In addition, in order for the attack 
to succeed, the company that produces SC must cooperate with the company 
that trades digital goods. 

Furthermore, this concern may be (hypothetically) backed up by the company 
A itself, by providing the trust in statement 3 with monetary incentives or a prize 
for everybody who proves that a SC misbehaves or can be broken. 

5.2 Private vs. Anonymous Retrieval of Information 

There is a principal difference between PIR approaches and anonymous tech- 
niques, such as those based on Chaum’s MIX. PIR hides the content of the 
queries and does not depend on third parties. Anonymous techniques do not 
hide the content of the queries from the server and do depend on third partied 
The consequences are that the administration complexity is higher and the pri- 
vacy is not perfect because it does not withstand “all-against-one” attack. 

5.3 Preprocessing Complexity 

The preprocessing complexity of our protocol is high (i.e. quadratic in A), even 
though it can be made offline as many times as secondary storage is available. 
We currently work on decreasing the preprocessing complexity jniEi. 

6 Conclusion and Future Work 

Private Information Retrieval (PIR) can solve privacy issues in many practical 
e-commerce applications by enabling the user to retrieve a record of his choice 
from the database in a way, that no one, not even the database server, observes 
the identity of the record. 

The existing PIR protocols either incur intolerable query response time (lin- 
ear in the size of the database) or introduce offline communication of the size of 
the entire database between the user and the server. Thus the applicability of 
both types of protocols is questionable from a practical point of view. 

^ Statements 1,2 must not be really trusted by users, because (i) the first statement 
can be checked via PKI, (ii) the second statement follows from conclusions of several 
tests of independent research labs. Both issues are sketched in an overview IRLP+oll . 
® That is, anonymous approaches require “other people (other entities) doing other 
things”, in opposite to PIR that does not require anything but one server. 
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We presented a new PIR protocol with preprocessing that has 0(1) query 
response time, optimal online communication complexity, and does not require 
offline communication. This property is due to a novel preprocessing algorithm 
based on shuffling and due to the usage of a secure coprocessor. We showed the 
trade-off between the online and preprocessing workloads for the protocol. The 
protocol is scalable for multiple queries and multiple secure coprocessors. 

An open question is whether the preprocessing complexity of our protocol is 
optimal or not. Minor additions should be made to the algorithms to withstand 
timing attacks. 
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A The Formal Protocol 

This part of the paper formally presents the concepts and algorithms discussed 
in this paper. Namely, we start with the proposed PIR protocol, which refers in 
turn to the database shuffling algorithm (formalized in Appendix EH and to 
the protocol for answering k-th query (formalized in Ar)r)endix lA.,3ll . 

A.l Almost Optimal PIR Protocol 

We give a simple version of the protocol, with one shuffled database. The re- 
shuffling takes place when a fixed number of queries have been answered. The 
protocol can easily be generalized by preparing several shuffled copies of the 
database, and by producing shuffled copies in parallel with the process of an- 
swering queries. 

The preprocessing phase of the protocol is handled by the SC and consists of 
periodically producing a number of shuffled databases (with appropriate indexes) 
using Algorithm 1. The online phase of the protocol works as follows. 
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(1) The SC initializes a query counter fc = 1, loads the index V of a shuf- 
fled database into the internal memory, and initializes the track of accessed 
records T = {0}. 

(2) The client (i.e., the user) comes up with a query Q = “return me the i-th 
record” 

(3) The client and the SC generate and exchange symmetric keys Keyc and 
Key sc using a public key infrastructure. 

(4) The client sends the encrypted query E{Q, Keysc) to the server. 

(5) The SC receives and decrypts the query. 

(6) The SC runs Algorithm 2 to get the answer A = Ri 0 . 

(7) The SC sends the encrypted answer E{A, Keyc) to the client. 

(8) The client decrypts the answer to the query. 

(9) The SC increments k by one. 

If A: > m, the SC switches to a new shuffled database, reloads the correspond- 
ing index, re-initializes the query counter k = 1 and the track of accessed 
records T = {0}. 

(10) Steo lA.ll mav be repeated. 

A. 2 Database Shuffling Algorithm 

To provide preprocessing for the PIR protocol the database shuffling algorithm 
is executed inside the SC. The only operations observable from outside the SC 
are read and write operations, which are used to manage the external storage. 
The complexity of this algorithm is 0{N^). The function shuffle that provides 
a shuffled index is constructed in accordance with |Knii81| . Sect. 3.4.2. A basic 
realization of the database shuffling algorithm is presented as Algorithm 1. 



Input: DB: a database of N records 

Output: DBshuf fled: a shuffled copy of DB, each record is encrypted; 

INDEXshuffied- an encrypted index of DBghuffied 

1: V = [1, ..., A] {Index of the database DB} 

2: V' = shuffle(V) {Prepare index for the shuffled database DBshuffied} 

3: for g = 1 to N do 
4: for h = 1 to N do 

5: read{Temp <= DB[h]) {Read the h-th record into the SC} 

6: if h = V'[g] then 

7: Record = Temp {Save the F'[g]-th record of the database internally} 

8: end if 

9: end for 

10: write{DBshuf fied[g] <= encrypt{Record) {Produce g-th record of DBshuffied} 

11: end for 

12: V^ncrypted = encrypt{V') {Encrypt the index with some key of the SC} 

13: write{INDEX shuffled ^ Vencrypted) {Store the encrypted index of DBshuffied} 

Algorithm 1: The basic database shuffling algorithm 



Algorithm 2 uses i, V' , and T to privately retrieve the requested record into the SC; 
it also updates T appropriately. 



Almost Optimal Private Information Retrieval 223 

A. 3 An Algorithm for Processing the k-th Query 

This algorithm (Algorithm 2) is executed inside the secure coprocessor, and is 
used as a part of the online phase of the PIR protocol. The only operations ob- 
servable from outside the SC are read operations to access the shuffled database. 
As discussed above, the complexity of this algorithm is 0(1). 



Input: DBsiiuffied, V ■ a shuffled copy of DB (each record is encrypted) and its index; 
k: the sequence number of the query being processed using DBghuf fled', 
i: the number of the DB record requested 
Output: Answer: record Ri of DB privately retrieved into the SC 

1: g = 1', GotAnswer = No {An indicator of the presence of Answer inside the SC} 
2: while g < k do 

3: read{Temp DB^^y^f fied\T[g\\) {Read previously accessed records one by one} 

4: if P'[T[p]] = i then 

5: Answer = Temp {One of the accessed records is the answer, save it} 

6: GotAnswer = Yes 

7: end if 

8 : £/ = £/+! 

9: end while 

10: if GotAnswer = No then 

11: obtain i' : V'[i'\ = i {Get the position of the i-th DB record in DB^huf fled} 

12: read{Answer <= DB^huf fied[i']) {Access the required record directly} 

13: T[k] = i! {The track list is updated with the fe-th item} 

14: else 

15: UnRead = {1, ...N} \ {T[l], ..., T[k - 1]} 

16: h = select-random-from{U nRead) {Select randomly one of the unread records} 

17: read{Temp DBe^yf fied[h\) {Read the selected record into the SC} 

18: T[k] = h {The track list is updated with the fe-th item} 

19: end if 
20: return Answer 



Algorithm 2: An algorithm for processing k-th query 
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Abstract. The technique Private Information Retrieval (PIR) perfectly protects 
a user’s access pattern to a database. An attacker cannot observe (or determine) 
which data element is requested by a user and so cannot deduce the interest of 
the user. We discuss the application of PIR on the World Wide Web and com- 
pare it to the MIX approach. We demonstrate particularly that in this context 
the method does not provide perfect security, and we give a mathematical 
model for the amount of information an attacker could obtain. We provide an 
extension of the method under which perfect security can still be achieved. 



1 Introduction 

The importance of the Internet continues to grow. Today, any user can find informa- 
tion about nearly any topic of interest. With the increasing use of the Internet, interest 
in collecting information about user behavior is also increasing. For example, an 
Internet bookstore would like to deduce the genre that a user reads, so that it can 
suggest other books of this genre to promote sales. This is of course a minor exam- 
ple; but if one were given all the collected information about a user in the WWW, one 
could characterize the user and his behavior in great detail. This is the real problem of 
privacy, as users may object to being profiled so specifically. Therefore, the user may 
want to protect his interest-data. In general the protection of interest-data can be done 
in two ways: 

1. Indirect: A number of users (n > 1) forms a so-called anonymity group and re- 
quests the interest-data using an anonymity protocol like e.g. a MIX- or DC- 
network [7, 8, 24]. Although the server is able to identify the requested interest- 
data, it cannot assign this data to a specific user. 

2. Direct: The user requests the interest-data and additional redundant data. There- 
fore, the interest-data is hidden in the overall data stream. An efficient technique 
for this purpose has been suggested by two different groups (see [9, 11]). 
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All work we know on privacy-enhancement in the WWW environment focuses on 
the first method, i.e. using intermediary stations like MIXes [5, 14, 27, 25, etc.]. In 
this paper we investigate the second approach, using Private Information Retrieval 
(PIR), and compare our results. Our main reasons for choosing PIR include: 

- Security: PIR provides perfect security (the technique itself is mostly comparable 
to the technique of DC-networks) and the method of previous work provide at 
most probabilistic security [23]. 

- Trust model (resp. attacker model): The MIX technique assumes that neither all of 
the people using the anonymity technique ((n-1)- Attack') nor all of the MIXes is 
use are corrupt. PIR just assumes that the servers in use are not all corruptl 

Of course, a direct comparison of PIR and MIX is not possible. PIR provides only 
a “reading” function, while MIXes can both send (write) and receive (read) messages. 
Therefore, we will only compare the reading function of the MIX with PIR. If we 
wanted to both functions we would have to combine techniques. 

Our work also differs from the classical PIR approach. To date, all works about 
PIR consider unstructured data (“flat” data) [I, 2, 3, 4, 6, 9, 10, 11, 13, 22, 30]. In our 
work, we investigate the application of PIR to the WWW environment, where the 
data structure can be modeled as a graph (structured data). 




Fig. 1. Basic concept of Private Information Retrieval and message service [9, 11]. 

In chapter 2 we briefly describe the PIR concept. Chapter 3 discusses the problems 
of applying PIR to the WWW. In chapter 4 we extend PIR for hierarchical data 
structures. Chapter 5 discusses network architecture for Web-PIR. Finally, chapter 6 
concludes this work. 



2 Private Information Retrieval (PIR) 

Private Information Retrieval [9] also known as message service [11] assures that an 
unbounded attacker is not able to discover the information that a user has requested 
(perfect security). The goal is to read exactly one datum that is stored in a memory 



' We do not know if there is a technique providing security against the (« - 1)-Attack (see also 
[18, 26]). 

^ If perfect security is not envisaged then PIR does not need to trust the server [3, 6, 20]. 
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cell of a server. To protect the privacy of the user, PIR accesses not just the single 
memory cell, several memory cells on several replicated servers. If not all of servers 
are corrupt then the method provides perfect security. The method consists of three 
steps: 

1. The client generates mostly random vectors containing “0” and “1” and sends 
them end-to-end-encrypted to the appropriate servers. 

2. All servers read the requested cells and superpose (XOR addition) them to one 
cell and send them end-to-end-encrypted to the user. 

3. The user superposes all the received cells to one cell. 

Figure 1 describes the idea of the method. If a customer wants to read a data item 
and if there are k PIR servers, he creates k - I random vectors. He XORs them to 
form the kth vector. In the kxh vector, he flips the bit representing the desired data 
item. The customer sends one of the vectors to each server. Each server retrieves the 
data items corresponding to I’s in its vector, and XORs them to create one datum. 
The result of the XOR sum is returned to the customer who XORs all the results and 
obtains the requested data item. 

Since the publication of the basic method, several groups have investigated this 
technique and extended it, e.g. the efficiency of the method has been increased (see 
[1, 2, 4, 10, 13, 22, 30]). In our work we will not review these extensions, since they 
all have the same problems with structured data. Thus, the presented problem and the 
suggested solution can be combined with the other improvements. 

3 Lessons Learned: PIR and the World Wide Web 

The data structure of the Web can be modeled as a graph [19]. In the next chapter we 
present an attack using the knowledge of the Web link structure. The attack is suc- 
cessful even though each individual page request the PIR servers is perfectly secure. 

3.1 Assumptions 

Using PIR the number of requests from a particular user is still observable. The at- 
tacker knows the number of links that connect one page to another, and therefore the 
minimum number of requests a user must make to travel that path. In this section we 
will discuss how to combine these two pieces of knowledge to uncover the users’ 
interest. 

For the sake of simplicity we assume the following features: 

1) A particular WWW-Page will be requested only once by a user. 

2) A user will request a page by mistake only from the root level, i.e. he will 
not mistakenly follow a path for more than one step. 

One may argue that the assumptions are not general enough and therefore are not 
fulfilled in the most cases. However, a single example of insecurity is enough to show 
that PIR does not provide perfect security. In fact, these three assumption may occur 
often than one may believe: 
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• When using PIR, it is desirable that all users use their local cache, since the re- 
quest cost is quite high (Feature 1). 

• If a user is not familiar with a Web site, then he may surf until he finds some 
interesting path (Feature 2). 

We believe that these assumptions are reasonable for the designer of a privacy 
service. 

3.2 Simple Deterministic Attack 

In the considered application environment it is clearer to speak of a user session 
rather than of visiting an abstract graph or requesting distinct pages of a Web. A user 
session can be seen as a process (function of time) starting from a starting node of a 
Web graph and reaching an arbitrary node. 



If the assumptions (I) and (2) hold, than we claim that PIR is not perfectly secure 
for Web browsing. Consider this example: A user requests three page from a site with 
the link structure as shown in Figure 2. After the first request the attacker knows that 
either page “7” or page “A” was requested, i.e. one element of the set {7, A}. The 
attacker distinguish further at this time. After the second request, however, the at- 
tacker knows that the user has requested Page “7” because: 

- with the first request the user requested “A” and now “7”, because the first request 
was mistaken; or 

- with the first request the user requested “7” and now “A because the first request 
was mistaken; or 

- with the first request the user requested “7” and now “2”, thus a path has been 



Page “2”, and by assumption (2) now knows the user’s interest. 

The example shows that a technique providing perfect protection by reading un- 
structured data items does not automatically lead to a technique providing perfect 
protection in general. In the following section generalize this result and give a 
mathematical model for the attacker’s information gain. 



0 & 




Fig. 2. Simple Web structure. 



chosen. 

After the third request the attacker knows^ that the user has requested Page “7” and 



^ Algorithm: Write all possibilities in separate sets and take the intersection of all sets. 



228 Dogan Kesdogan, Mark Borning, and Michael Schmeink 



3.3 Mathematical Model for the Information Gain 

A hierarchical data structure is a set of data elements V := {5^ SJ combined with a 
set E of links between the data elements of V. Lgi^V is defined as the set of starting 
elements. Additionally, we assume that the edges are associated with transition prob- 
abilities. Such probabilities can be obtained by analyzing the hitlogs of a site (see 
[19]). 

Following [29, 23] we call a request perfectly unobservable, if an attacker is not 
able to gain any information about which page is requested by observing the session. 
We assume that the attacker still can observe the number of requests. Thus the 
mathematical model follows the deterministic model (above example). 

We give two models here, one for the information gained by observing one session 
and one for that gained observing several sessions. This distinction is useful, particu- 
larly because information can be gained by observing several sessions. Such infor- 
mation can be used as a general basis for a time-frequency analysis. 



Single-Session Model 

Using the above-mentioned transition probabilities we can define a probability distri- 
bution on the set M of all possible paths that a user can take. Let g: M — » Nat m i— > 

g{m) = l^ describe the length of a path m. The random variable M denotes the path m 
in M explored by the current user. Let L be a random variable describing the length 
of the path chosen by the user. Thus, it holds that L=g(M). Furthermore, let A, = (m I 
m G M, g{m) = /} be the set of all paths with length 1 . Any opponent can count the 
number of accesses, and thus the length of a user’s path. The ideal case would be that 
P{M = m\ L = 1) = P{M = m) holds for all m in M . 



Unfortunately, the attacker can rule out any path with a length different from the 
one observed. Thus, 

I _ P{M =m,g{M) = l) _ P{M = m,g{M) = l) 

P{g(M) = l) ■ P{MeA,) 
\p{M=m)/P{MeA,), if g(m)=l 
[ 0, else 

Let H{M) denote the uncertainty measure of M and H{M \ L) the conditional un- 
certainty of M given L. We are interested in the information conveyed about M by L. 
Since L is totally dependent on M and g is a function of M, we obtain 

i{m,L) = H{M)-h{m\l)=H{M) + H{L)-H{M,L) 

= H{M) + H{L)-H{M,g{M)) 

= H{M) + H{L)-H(M) 

= H{L). 




Nat = natural number 
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With a suitable choice for the logarithmic base used in the uncertainty measure, it 
holds that 0 < HiL) < 1 , where 0 denotes zero information and 1 maximum informa- 
tion conveyed, respectively. Thus, H{L) is a measure for the mean gain in information 
observed during one user session. To calculate H{L) , we just need the probability 
distribution of L, given by: p{l = /) = X ~ ■ 

W€/4, 



Multiple-Session Model 

If users are not anonymous and reveal their IP address, then opponents can gain in- 
formation not only during one user session but also by observing a sequence of ses- 
sions of a single user. 

For example, we can assume that the attacker can distinguish reasonable and un- 
reasonable paths, and that a user will not choose unreasonable paths repeatedly. More 
generally, let (Mj,..., be a random vector denoting the paths chosen by a user 
over n sessions. An opponent will observe a random vector of path lengths (Lj, ..., 
L) = (g(M,), ...,g(MJ). 

Similarly to the one-dimensional case, the information conveyed about (Mj,..., 
by (Lj,...,LJ can be simplified to: 

/((m,,...mJ, (Ti,...T„ ))=//(M,,...Mj-i/(M,,...M„|Lj,...Lj 

= h(l.,...l ) 



Again, //(Lj,...,LJ is a measure for the mean gain in information observed during a 
sequence of n sessions. Admittedly, the estimation of the probability distributions of 
(Mj,..., and (Lj,...,LJ, respectively, would lead to some practical problems. 



3.4 Resulting Consequences for PIR 

The direct application of PIR on structured data is unhelpful for security and for per- 
formance Normally, PIR requests data among the whole data space. Thus, in average, 
50% of the whole database is requested in each random vector. But, if the assumed 
terms in chapter 3.1 are true, the attacker can observe the request hierarchy and ex- 
clude some of the requested pages as irrelevant because they are unrelated. Thus, 
there is no security gain in building a random vector over the whole data set. It is 
much more effective from the performance point of view (without losing any security 
strength) to build the vector explicitly over the particular hierarchy. 

We can conclude that applying PIR on structured data reduces the anonymity set 
(the subset of the entire data space which the attacker must consider as possible alter- 
natives) with the following consequences: 
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- To provide perfect security it has to be ensured that in each hierarchy level the size 
of the anonymity set is greater then 1, i.e. the Web graph must be re-structured. 

- From the performance point of view the random vectors should be limited to the 
elements of the anonymity set belonging to the hierarchy level. 



4 Extensions to PIR 

As described in the last section, accessing web pages in the WWW can be viewed as 
a user session on a hierarchical data structure. It was shown that an observer is able to 
determine the number of accesses, i.e. the length of the user session. If he also knows 
the data structure, the observer is able to exclude data items that are not reachable in 
this number of accesses. IN this section, we will first a method for creating hierarchy 
layers. Then we will describe a method to efficiently load data items into any hierar- 
chy layer. The latter method is called hierarchical blinded read. 

4.1 Creation of Hierarchy Layers 

In section 3.1, we stated two assumptions defining the access to simple web struc- 
tures. Within these requirements, the creation of hierarchy layers is more difficult, 
because a user could request (n - 1) incorrect items if there are n starting items. 
Therefore, the hierarchy layers are enlarged as shown in the example in section 3.1. 

In Fig. 3, there are four hierarchy layers. To determine these layers, we partition 
the web graph into three sets := { 1, 2}, := {3, 4, 5, 6} and := {7, 8, 9, 10, 11 }. 




The first hierarchy layer will be the set L„, because a user is only able to access 
items of this set. The second layer will be u L,: The user may have incorrectly 
accessed an item of the set and is now accessing the other item; or the user has 
accessed an item of the set L„ and is now accessing an item of the set L,. The third 
layer will be u because the user has either accessed an item of the set in the 
last step and is now accessing an item of the set L^, or he has accessed an item of the 
set and is now accessing an item of the set L,. The remaining hierarchy layer is the 
set Lj. An example of a three-step-path is the sequence (1, 4, 10)', a four-step-path is 
the sequence (1, 2, 5, 9). 
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In general terms: Let there be a hierarchical data structure with levels L„, ; 

and let e e L. if e is reachable over a path of length i from the starting set. If IL„I = n, 

then a hierarchy layer A.,i=\,...,k + n-\,K defined as := L„ , A^ := |J L. 

For j < 0 and j > k, let L = 0. Therefore, a sufficient condition for perfect secrecy 
is IA,J >1. 

With the creation of hierarchy layers, we are able to define an algorithm to access 
the layers. To identify an item in a layer, the identifiers of the items are lexicographi- 
cally ordered and numbered consecutively. For example, the Uniform Resource Iden- 
tifier (URI) is the item’s identifier in a given web structure. The URIs can be ar- 
ranged in a lexicographical order. 

4.2 Hierarchical Blinded Read 

Having introduced a method to create hierarchy layers in hierarchical data structures 
that respect the requirements of section 3, we will define an algorithm in this section 
that realizes a blinded read access to these hierarchy layers. The algorithm is called 
hierarchical blinded read (HBR). The HBR algorithm is divided into two parts, the 
client algorithm and the server algorithm. The most important change is the introduc- 
tion of a new parameter into the original algorithm, the hierarchy layer. 

On the client site, the algorithm takes as input the layer I of the target item and off- 
set o of the that item in its layer. The output is the item itself. The parameter I is also 
needed on the server, so the client has to send this parameter. Therefore, we extend 
the request vector by appending the layer 1. 

Let = IL 1 be the number of items in hierarchy layer L , k the number of servers, I 
the queried hierarchy layer, and o the offset of the target item in this layer. To sim- 
plify the calculation, we will not use binary vectors. Instead, we represent the binary 
vector as a natural number, where the highest order bit represents the item n. and the 
lowest order bit represents the item 1 of the hierarchy layer L.. In Fig. 3 , the number 
1 (=2°) represents web page “7” of layer 2, and the number 16 (= 2‘') represents web 
page “5” of the same layer. 

The client algorithm has two phases, the request phase and the receive phase. In 
Fig. 4, the request phase of the client algorithm is shown. First, the client creates the 
random request vectors and randomly chooses two servers. Server y receives the 
XOR sum, and server w receive that value with the flipped bit representing the user’ s 
actual request. At the transmission, two function are used. CreatePacket creates an 
encrypted packet p that contains the request value and the hierarchy layer 1. Send- 
Packet transmits packet p to server i. 

The receiving phase (Fig. 5) uses three additional functions. ReceivePacket gets a 
packet and stores it in r. ExtractServer gets the server’s identification number from 
the response r, and ExtractData gets the responded data from r. Vector hr is used to 
check whether all responses are received, and, in addition, hr can be used to check if 
responses are sent more than once. 
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Input: k - number of servers, 

1 - hierarchy layer, 
n - items in hierarchy layer, 

o - offset of wanted item in hierarchy layer 

y := random (k) 
w := random (k) 

Xj,: = 0 

For i = 0 to k-1 
If i y 

x[i] := random (2") 
x[y] := x[y] © x[i] 

X [w] : = X [w] © 2 °'^ 

For i = 0 to k-1 

p := CreatePacket (l,x [i] ) 

SendPacket (i,p) 

Fig. 4. Request phase of the client algorithm. 

The client algorithm is almost the same as the original blinded read in [11] . The 
change in the request phase is the introduction of the hierarchy layer, because the 
layer is important in creating the request vectors and has to be transmitted to the 
server. In the receiving phase, the vector hr prevents any server from sending multi- 
ple responses. 

Input : k - number of server 

d := 0 

For i = 0 to k-1 
br [i] : = 0 

While br (1, . . . , 1) 

ReceivePacket (r) 
s := ExtractServer (r) 
e := ExtractData (r) 

If br[s] = 0 
d : = d © e 
br[s] := 1 

return d 

Fig. 5. Receive phase of the client algorithm. 

If a server receives a query, he loads the corresponding items, creates their XOR 
sum, and sends it as a response packet to the client. Additionally, the response packet 
contains the unique identification number of the server. 

The server algorithm (Fig. 6) uses some additional functions. ReceivePacket waits 
for a new packet and stores it in p; ExtractLayer gets the hierarchy layer I from the 
packet p\ ExtractRequest gets the request value x from p; and GetLayerSize deter- 
mines the layer’s number of items n. After that, the corresponding items are loaded, 
i.e. it is checked what bits are set in the request value. After creating the XOR sum, 
CreatePacket is used to create the response packet and SendResponse transmits it to 
the client. 
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Input: s - Identification number of the server 

p := ReceivePacket 
1 := ExtractLayer (p) 

X := ExtractRequest (p) 
n := GetLayerSize (1) 
e : = 0 

For i = 1 to n 

If (x AND 2"") 0 

e : = e © d [1 , i] 
r := CreatePacket (s , e) 

SendResponse (r) 

Fig. 6. Server algoritbm. 

The server algorithm differs from the original algorithm in [11], because the items 
are accessed differently, requiring the server to get the hierarchy layer with a request. 
Furthermore, the server transmits a unique identification number, because the client 
has to address the responses to the various servers. 
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Fig. 7. Request phase of the client 



Fig. 7 shows an example execution of the request phase. Web page “5” of the Fig. 
is requested, i.e. the input for the algorithm is 1 = 2, o = 5. There are k = 3 servers and 
the number of pages in the hierarchy layer 2 is n2 = 6. In the loop, only the result of 
every run is listed. Furthermore, the request values xl, x2 and x3 are presented as a 
vector of length 3. 

Fig. 8 shows the processing of an access at the first server. The server gets a re- 
quest vector containing 1 = 2 and x = 25. The binary representation of x is 1 1001, i.e. 
the server has to load the pages “1”, “4” and “5”. The server creates the XOR sum of 
them, creates a response packet containing the sum and his identification number, and 
transmits the packet to the client. 

In Fig. 9, the loading process of page “5” is shown. The client generates three re- 
quests and transmits them to the servers. The servers access their data repository to 
load the requested pages, create the XOR sum, and transmit the resulting items. The 
client combines the responses using the XOR function and gets the target page “5”. 
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4.3 Remarks 

The HBR algorithm requires an additional parameter as input, the hierarchy layer. On 
the other hand, the HBR algorithm could calculate this parameter itself. Moreover, 
the parameter need not be given at all: If the server is able to identify a client, it can 
determine the number of accesses by this client and calculate the hierarchy layer 
itself. The administration of such an approach is very high, however, because the 
server would have to store the number of accesses of any identified client in an inter- 
val. Therefore, the interval has to be specified, so exactly one user session will be in 
it. Furthermore, the number of clients could be very large, and all of them would have 
to be stored. If we calculate the hierarchy layer in the client’s HBR algorithm, we 
have to determine the interval of a user session, too. For these reasons, we have cho- 
sen an implementation in which the user himself determines of the hierarchy layer. 
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5 Blinded Read Networks 

Loading of a data item using blinded read requires some independent servers. All 
servers form the so-called Blinded Read Network. A client has to access all servers to 
receive a data item. In the previous sections, we described the client and server inter- 
action. In this section, we will describe an architecture that could be used for the 
blinded read. 



5.1 Closed Architecture 

The closed architecture is almost the same as the static approach of the original 
blinded read. Every server contains all data items that can be accessed over the 
blinded read network. An application area of the closed architecture is the World 
Wide Weh. The application area covers databases that contain slowly changing data, 
e.g. a news archive or product catalogues of HiFi-components. The change of the 
database, i.e. the change, the deletion, or insertion of an item, requires propagation 
onto every server. Therefore, it is necessary to define synchronization times to change 
the database to the new version. 

The structure of a blinded read server of the closed architecture is shown in Fig. 
10. The blinded read server has an interface any client can connect to. In its trusted 
area, a database contains the available data, i.e. the database is stored at the same 
computer as the blinded read server or it can be accessed very efficient over a local 
network. 




A client needs an address table to access the data items. The blinded read server 
will create this table and will transmit it to any client who requests it. The table fur- 
ther contains the synchronization times, so the client knows how long this table will 
be valid, and when he must request a new table. 

The advantage of the closed architecture is the direct and fast access to the data 
items. Furthermore, the architecture has the same security properties as the blinded 
read, so no additional protection is necessary. However, a close architecture has lim- 
ited resources, i.e. limited storage, and the synchronization requirement makes it 
more difficult to administer. 
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6 Conclusion 

In this work we demonstrated that PIR can be used to protect the user’s interest. PIR 
provides strong protection - and potentially perfect protection if the modifications 
suggested in this work^ are used. Thus, Web-PIR is an alternative to the MIX-based 
solutions if only the reading function is considered. The following features can be 
identified comparing the applications of the both techniques (see Table 1). 



Table 1. Difference between Mix and PIR. 



Method 


Protection 


Client 


Server 


Trust model 


MIX 


Complexity 

theoretic 

approach 


Organization of n 
clients 


No modifica- 
tion 


N MIX stations and 
n clients 


PIR 


Perfect 


Spontaneous 

communication 


New design 


N Server® 



As shown in the beginning of this work, the MIX technique provides at most a 
complexity-theoretic protection. All known implementations are far behind this pro- 
tection goal. Another drawback is the trust model: The more trust is required by the 
system, the more attack possibilities will arise. The MIX technique has to trust n 
other clients and N servers. Particularly, the organization of n other clients is a critical 
part here (for this function a global public key infrastructure is needed) if the maxi- 
mum protection strength is envisaged. All of the deployments known to us neglect 
this important point. 

The great advantage of the MIX method is that it can be used for both sending and 
receiving of messages. Since there is no need for a major modification of all servers, 
potentially all servers in the Internet could be addressed. The general applicability is a 
major drawback of PIR, since the server has to be redesigned. But for some servers 
providing some critical services (information), PIR can provide perfect protection of 
the user’s privacy. 
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